非参数统计第七讲.pdf

Chap8 非参数密度估计技术参考:王星2009《非参数统计》清华大学出版社主讲：王星助教：范超中国人民大学统计学院办公地点：明德主楼1019 办公电话：82500167 课程网站：https://dm.ruc.edu.cn 2014年12月24日基本概念 • 想一想：什么是分布密度？分布密度有什么用？色泽不均衡可能是催熟西瓜 Zipf齐普夫定律:在自然语言的语料库里，一个单词出现的频率与它在频率表里的排名成反比分布密度和一个随机变量取值分布的均衡性有关系，不均衡常常是世界的常态，语言学中重要的词一定被使用的频次高、食品安全监测中的分布异常可能是风险的一个标志？通过数据估计分布密度通常都有什么方法？非参数密度估计直方图 Parzen Windows窗 Kernel density estimator 多元密度估计判别分析 Introduction • 大部分的参数密度都是单峰的 (have a single local maximum), 很多实际问题会涉及多峰问题 • 非参数统计过程将涉及假定宽松的数据结构. • 有两种常见的非参数密度估计问题: – 估计似然函数 P(x|j ) – 直接估计后验概率密度估计 – Basic idea: Probability that a vector x will fall in region R is: P =  p( x' )dx' (1)  Therefore, the ratio k/n is a good estimate for the probability P and hence for the density function p.  p( x' )dx'  p( x)V (4)  p(x) is continuous and that the region R is so small that p does not vary significantly within it, we can write: k /n pˆ n ( x)  V where x is a point within R and V the volume enclosed by R. equation (1) and (4) yields histogram: 直方图 • Dissects the range of the data into bins of equal width along the horizontal axis • Vertical axis represents the frequency counts (or percents, proportions)—Bars represent the counts • Fewer bins, smoother histogram, but less detail about the distribution • Trade-off between smoothness and detail: We want to preserve as much detail as possible but we do not want the graph to be too rough (difficult to discern shape) 最佳窗宽选择 oversmoothing k /n pˆ n ( x)  V unstable 不最优理论窗宽 Histogram 定理: 则L2损失下的最优风险为: 极小化上面的式子,可以得到理想的窗宽: 在这个窗宽的选择下选择箱量（等价于窗宽）偏差与方差分解模型偏差太大模型方差太大 bias-variance偏差和方差分解 ~ For any estimator  : ~ ~ MSE( ) = E ( −  ) 2 ~ ~ ~ = E ( − E ( ) + E ( ) −  ) 2 ~ ~ 2 ~ = E ( − E ( )) + E ( E ( ) −  ) 2 ~ ~ = Var ( ) + ( E ( ) −  ) 2 bias Note MSE closely related to prediction error: ~ 2 T 2 T ~ T 2 2 T ~ E (Y0 − x  ) = E (Y0 − x0  ) + E ( x0  − x0  ) =  + MSE ( x0  ) T 0 The practical approximate bandwidth from Cross Validation Parzen Windows（固定V） – Parzen-window approach to estimate densities assume that the region Rn is a ddimensional hypercube Vn = hnd (hn : length of the edge of  n ) Let  (u) be the following window function : 1  j = 1,... , d 1 u j   (u) =  2 0 otherwise ((x-xi)/hn) is equal to unity if xi falls within the hypercube of volume Vn centered at x and equal to zero otherwise. – The number of samples in this hypercube is:  x − xi   k n =    hn  i =1  i =n By substituting kn in equation (7), we obtain the following estimate: 1 i =n 1  x − x i   pn (x ) =   n i=1 Vn  hn  Pn(x) estimates p(x) as an average of functions of x and the samples (xi) (i = 1,… ,n). These functions  can be general! – 举例: The behavior of the Parzen-window method Case where p(x) →N(0,1) Let (u) = (1/(2) exp(-u2/2) and hn = h1/n (n>1) (h1: known parameter) Thus: 1 i = n 1  x − xi   pn ( x ) =    n i = 1 hn  hn  is an average of normal densities centered at the samples xi. R中常用的核函数核估计的性质核估计的性质 4 核估计的性质应用：分位回归的参数分布估计 • 给出一个分位回归模型fit=rq(y~x)后，命令summary(fit,se=‘…’) 可以查看参数估计的结果 • se选项用于选择参数估计的不同方法，se=‘ker’:核函数估计法 library(quantreg) fit1=rq(foodexp~income,data=engel) summary(fit1,se="ker") summary(fit1,se="boot") summary(fit1,se="nid") • 因为残差分布未知，无法直接求出 fi ( i ( )) H n ( ) • Powell给出如下估计方法： n 1 Hˆ = I (| ui | cn ) xi xi '  2cn n i =1 sm包 confidence envelope 多维密度估计（h一致，h不一致）二元密度估计课堂作业和讨论：北京市学区房价格分布与周边价格密度估计三维密度估计判别分析 – Classification example In classifiers based on Parzen-window estimation: • We estimate the densities for each category and classify a test point by the label corresponding to the maximum posterior • The decision region for a Parzen-window classifier depends upon the choice of window function as illustrated in the following figure. • The sea bass/salmon example • Decision rule with only the prior information – Decide 1 if P(1) > P(2) otherwise decide 2 • P(x | 1) and P(x | 2) describe the difference in lightness between populations of sea bass and salmon 例: 基于非参数密度估计下的判别计算(二分类问题求解步骤) • 1. 先验密度, 损失矩阵→计算域值. • 2. 非参数似然密度估计→生成判别决策. • 3. 给出新的点,比较判别决策的的判定. Bayes’ Rule • Posterior, likelihood, evidence posterior likelihoodprior – P(j | x) = P(x | j) . P (j) / P(x) evidence – Where in case of two categories j=2 P ( x ) =  P ( x |  j )P (  j ) j =1 – Posterior = (Likelihood. Prior) / Evidence 更一般的Bayes公式的解释假设空间: H={H1 , …, Hn} 样本和数据: E P( E | H i )P( H i ) P( H i | E ) = P( E ) If we want to pick the most likely hypothesis H*, we can drop P(E) Posterior probability of Hi Prior probability of Hi P( H i | E )  P( E | H i ) P( H i ) Likelihood of data/evidence if Hi is true 48 • Decision given the posterior probabilities X is an observation for which: if P(1 | x) > P(2 | x) if P(1 | x) < P(2 | x) True state of nature = 1 True state of nature = 2 因此: 当观察到某个 x, 我们各种决定可能的错误是: P(判错| x) = P(1 | x) 如果决策是 2 P(判错 | x) = P(2 | x) if we decide 1 • Minimizing the probability of error • Decide 1 if P(1 | x) > P(2 | x); otherwise decide 2 • 因此有关判错可以有如下的等价表达: P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1 P(error | x) = min [P(1 | x), P(2 | x)] The preceding rule is equivalent to the following rule: P ( x | 1 ) l12 − l22 P ( 2 ) if  . P ( x |  2 ) l21 − l11 P (1 ) Then take action 1 (decide 1) Otherwise take action 2 (decide 2) 结论: 贝叶斯决策规则可以解释成如果似然比超过某个不依赖于观测值x的阈值,那么判断为1 . 例: 基于非参数密度估计下的判别计算 • • State:{1, 2}, Action : 1 : deciding 1 2 : deciding 2 • The preceding rule is equivalent to the following rule: if P ( x | 1 ) l − l 22 P ( 2 )  12 . P( x | 2 ) l 21 − l11 P (1 ) • Then take action 1 (decide 1) • Otherwise take action 2 (decide 2) 两类不同鱼光泽度的分布密度: L= 0 1 2 0 newpoint=2 newpoint=0.1 class=1 class=2 本章要求 • 掌握密度估计基本原理； • 掌握几种多维可视化的建模方法 • 了解密度估计的应用