非参数统计第八讲.pdf

2016/12/21 大纲第8章非参数回归 • 核光滑回归 • 局部多项式回归 • 稳健回归 • *K近邻回归 • *正交序列回归 • *B-Spline 参考:王星2014 非参数统计chap8 王星办公电话:86-10-82500167 电子邮箱:wangxingwisdom@126.com 1.非参数回归 • The aim of a regression analysis is to produce a reasonable analysis to the unknown response function m, where for n data points ( X i , Yi ), the relationship can be modeled as Parametric & partial parametric Yt  m( X t )   ( X t ) t Yi  m( X i )   i , i  1,  , n (1) • Unlike parametric approach where the function m is fully described by a finite set of parameters, nonparametric modeling accommodate a very flexible form of the regression curve. 超强适应的回归形式 3 Motivation • It provides a versatile method of exploring a general relationship between variables，can be used to test for nonlinearity. 提供更丰富的用于表达变量关系的视角,表达非线性结构 • It gives predictions of observations yet to be made without reference to a fixed parametric model 不需要在固定的参数形式下获得预测 • It provides a tool for finding spurious observations by studying the influence of isolated points 提供了一种发现异常观测并研究它可能影响的方法 • It constitutes a flexible method of substituting for missing values or interpolating between adjacent X-values 面对数据存在缺失或需要对缺失进行相邻插值时，它的适应 5 性很强 4 光滑回归的基本原理 • A reasonable approximation to the regression curve m(x) will be the mean of response variables near a point x. This local averaging procedure can be defined as n mˆ ( x)  n 1  Wni ( x)Yi (2) i 1 Every smoothing method to be described is of the form (2). Whi ( x)  K h ( x  X i ) / fˆh ( x) (3) where fˆh ( x)  n 1 i 1 K h ( x  X i ) , and K h (u )  h 1K (u / h) . Kernel smoothing describes the shape of the weight function Wni (x) by a density function K with a scale parameter that adjusts the size and the form of the weights near x. The kernel K is a continuous, bounded and symmetric real function which integrates to 1。 6 n 1 2016/12/21 Kernel Smoothing核光滑 • The Nadaraya-Watson estimator is defined by i K h(x  X i )Yi n i K h(x  X i ) n mˆh(x )  1 (4) 1 ˆ h ( x)  m( x)]2 ，当均方误差 d M ( x, h)  E[m n  , h  0, nh  , 我们有如下结论：这里 Figure 2. The Epanechnikov kernel K (u) = 0.75(1-u2) I (|u| <= 1 ). d M ( x, h)  (nh) 1 2 cK  h 4 d 2 K [m' ' ( x)]2 / 4 (5)  2  var( i ), cK   K 2 (u )du, d K   u 2 K (u )du Figure 3. The effective kernel weights for the food versus net income data set. K h ( x  ) / fˆh ( x) at x = 1 and x = 2.5 for h = 0.1 ( label 1 ), h = 0.2 ( label 2 ), h = 0.3 ( label 3 ) with Epanechnikov kernel. 当 h增大时，偏差bias增加的时候方差会下降。. 7 N-W估计中核的选择影响微乎其微，带宽的影响比较大 8 局部回归 -Local Regression • 局部回归方法： The amount of averaging is controlled by a smoothing parameter. The choice of smoothing parameter is related to the balances between bias and variance.       取每个局部点 x0 附近，长度s=k/n的邻域分段依据距离，为邻域内点赋予权重K 0 ，外围点权重为0 n 最小二乘拟合，使估计参数满足：min K i 0 ( yi   0  1 xi ) 2 i 1 联合各点函数拟合预测模型自变量较多，可考虑有选择的选取自变量进行局部回归维数≤3,4；高维模型稳定性易受训练集稀疏性的制约带宽变化时模式的变化 9 2016/12/21 2.局部多项式回归回忆标准非参数型： Yi  m( X i )   i , i  1,  , n (1) m( x0 ) m( x)  m( x0 )  m( x0 )( x  x0 )  ( x  x0 ) 2  L ...  2! m p  ( x0 ) ( x  x0 ) p  O ( x  x0 ) p 1 p! 在待估计点附近做局部多项式拟合：  n p  2  Y     X  x   K  X  x  t 1  t j 0 j j t 0  h t 为了实现局部多项式估计，我们需要选择多项式的阶数p ，带宽h以及核函数K .当然这些参数相互关联．当 h   时，局部多项式拟合就变成全局多项式拟合，阶数 p决定模型的复杂性。 0 局部多项式的矩阵表示为： min  y  X   W  y  X   T  11 12 2 2016/12/21 与参数模型不同，局部多项式估计拟合的复杂性是由带宽来控制的, 通常 p 是较小的，故而选择 p 的问题就变得不重要了．如果目的是估计 mv ，则当 p  v 是奇数，局部多项式拟合自动修正边界偏倚．进一步，则当 p  v是奇数，与 p  1阶拟合相比较，p 阶拟合包含了一个多余常数，但没有增加 mv估计的方差。不过这个参数创造了一个降低偏倚的机会，特别是在边界区域．另一方面，带宽 h 的选择在多 h 项式拟合中起着重要作用．太大的带宽引起过渡平滑，产生过大的建模偏倚，而太小的带宽会导致不足平滑，获得受干扰的估计。局部回归中不同的窗宽结果 14 3.稳健回归LOWESS locally weighted scatterplot smoother • 基本思想：局部线性估计稳健的权重平滑（残差大的减小权重） MAD MAD=median(|ri-median(ri)|) 15 #Step 2 #Applying the Tricube Weight #Tricube function tricube <- function(z) { ifelse (abs(z) < 1, (1 - (abs(z))^3)^3, 0) } #Bisquare weight bisquare <- function(z) { ifelse (abs(z) < 1, (1 - (abs(z))^2)^2, 0) } plot(range(TIME), c(0,1), xlab="Time (in days)", ylab="Tricube Weight", type='n', main="The Tricube Weight") abline(v=c(x0-which.diff, x0+which.diff), lty=2) abline(v=x0) xwts <- seq(x0-which.diff, x0+which.diff, len=250) lines(xwts, tricube((xwts-x0)/which.diff), lty=1, lwd=2) points(x.n, tricube((x.n - x0)/which.diff), cex=2) 17 #Step1 #Defining the window width plot(TIME, LIBERAL, xlab="Time (in days)", ylab="Liberal Support", type='n', main="Defining the Window Width") ord <- order(TIME) Lib=LIBERAL time <- TIME[ord] pre <- LIBERAL[ord] x0 <- time[10] diffs <- abs(time - x0) which.diff <- sort(diffs)[16] abline(v=c(x0-which.diff, x0+which.diff), lty=2) abline(v=x0) points(time[diffs > which.diff], Lib[diffs > which.diff], pch=16, cex=2, col=gray(.75)) points(time[diffs <= which.diff], Lib[diffs <= which.diff],cex=2) x.n <- time[diffs <= which.diff] y.n <- Lib[diffs <= which.diff] 16 text(locator(1), "Window Width") #Step 3 #The local polynomial plot(TIME, LIBERAL, xlab="Time (in days)", ylab="Liberal Support", type='n', main="Local Linear Regression") abline(v=c(x0-which.diff, x0+which.diff), lty=2) abline(v=x0) points(x.n, y.n, cex=2) mod <- lm(y.n ~ x.n, weights=tricube((x.n-x0)/which.diff)) reg.line(mod, lwd=2, col=1) points(x0, predict(mod, data.frame(x.n=x0)), pch=16, cex=1.8) text(locator(1), "Fitted Value of Y at Focal X") 18 3 2016/12/21 19 20 21 22 library(car) # for data sets data(Prestige) attach(Prestige) plot(income, prestige, xlab="Average Income", ylab="Prestige") lines(lowess(income, prestige, f=0.5, iter=0), lwd=2) lines(lowess(income, prestige, f=0.8, iter=0), lwd=2,col=4) lines(lowess(income, prestige, f=0.1, iter=0), lwd=2,col=6) 23 24 4 2016/12/21 普通的局部多项式回归稳健的多项式回归对异常值的变化 25 案例:NOx排放量与发动机性能之间的关系 26 Data：NOx排放物数据ethanol 发动机的压缩比背景：重度雾霾政策解读----减少机动车行驶已有的研究 •发动机压缩比：高压缩比发动机高温作业下，可引发轻微爆燃现象，导致NOx排放量增加。 •燃料空气当量比：燃料与空气比例小于1或在1附近时，对应着空气未得到完全燃烧，造成燃烧效率低下，产生较多尾气。问题：有没有关系 •两个变量对Nox实际会产生怎样的影响？ •影响的模式是怎样的？关系如何定义 •模型中的参数是怎样估计的？稳定的关系是通过 27 参数如何控制的排放物NOx 成分多少燃料-空气当量比 NOx CompRatio EquivRatio 3.741 2.295 1.498 2.881 0.76 3.12 0.638 1.17 2.358 0.606 12 12 12 12 12 9 9 9 12 12 0.907 0.761 1.108 1.016 1.189 1.001 1.231 1.123 1.042 1.215 #cross-validation alpha =seq(0.2,1,by=0.02) n1=length(alpha) g=matrix(nrow=n1,ncol=4) for (k in 1:length(alpha)) { g[k,]=gcv(NOx~lp(E,nn=alpha[k]),data=ethanol)} plot(g[,4]~g[,3],ylab="GCV",xlab="degrees of freedom") f1=locfit(NOx~lp(E,nn=0.3),data=ethanol) plot(f1) 散点图和局部线性模型 plot(NOx~C,data=ethanol) fit=locfit(NOx~lp(E,nn=0.5),data=ethanol) plot(E,NOx,data=ethanol) lines(fit) 29 30 5 2016/12/21 fit1=locfit(NOx~lp(C,E,nn=0.3,scale=0),data=ethanol) plot(fit1) 高排量汽车发动机不充分燃烧 31 6