非参数统计第八讲.pdf
2016/12/21 大 纲 第8章 非参数回归 • 核光滑回归 • 局部多项式回归 • 稳健回归 • *K近邻回归 • *正交序列回归 • *B-Spline 参考:王星2014 非参数统计chap8 王 星 办公电话:86-10-82500167 电子邮箱:wangxingwisdom@126.com 1.非参数回归 • The aim of a regression analysis is to produce a reasonable analysis to the unknown response function m, where for n data points ( X i , Yi ), the relationship can be modeled as Parametric & partial parametric Yt m( X t ) ( X t ) t Yi m( X i ) i , i 1, , n (1) • Unlike parametric approach where the function m is fully described by a finite set of parameters, nonparametric modeling accommodate a very flexible form of the regression curve. 超强适应的回归形式 3 Motivation • It provides a versatile method of exploring a general relationship between variables,can be used to test for nonlinearity. 提供更丰富的用于表达变量关系的视角,表达非线性结构 • It gives predictions of observations yet to be made without reference to a fixed parametric model 不需要在固定的参数形式下获得预测 • It provides a tool for finding spurious observations by studying the influence of isolated points 提供了一种发现异常观测并研究它可能影响的方法 • It constitutes a flexible method of substituting for missing values or interpolating between adjacent X-values 面对数据存在缺失或需要对缺失进行相邻插值时,它的适应 5 性很强 4 光滑回归的基本原理 • A reasonable approximation to the regression curve m(x) will be the mean of response variables near a point x. This local averaging procedure can be defined as n mˆ ( x) n 1 Wni ( x)Yi (2) i 1 Every smoothing method to be described is of the form (2). Whi ( x) K h ( x X i ) / fˆh ( x) (3) where fˆh ( x) n 1 i 1 K h ( x X i ) , and K h (u ) h 1K (u / h) . Kernel smoothing describes the shape of the weight function Wni (x) by a density function K with a scale parameter that adjusts the size and the form of the weights near x. The kernel K is a continuous, bounded and symmetric real function which integrates to 1。 6 n 1 2016/12/21 Kernel Smoothing核光滑 • The Nadaraya-Watson estimator is defined by i K h(x X i )Yi n i K h(x X i ) n mˆh(x ) 1 (4) 1 ˆ h ( x) m( x)]2 ,当 均方误差 d M ( x, h) E[m n , h 0, nh , 我们有如下结论: 这里 Figure 2. The Epanechnikov kernel K (u) = 0.75(1-u2) I (|u| <= 1 ). d M ( x, h) (nh) 1 2 cK h 4 d 2 K [m' ' ( x)]2 / 4 (5) 2 var( i ), cK K 2 (u )du, d K u 2 K (u )du Figure 3. The effective kernel weights for the food versus net income data set. K h ( x ) / fˆh ( x) at x = 1 and x = 2.5 for h = 0.1 ( label 1 ), h = 0.2 ( label 2 ), h = 0.3 ( label 3 ) with Epanechnikov kernel. 当 h增大时,偏差bias增加的时候方差会下降。. 7 N-W估计中核的选择影响微乎其微,带宽的影响比较大 8 局部回归 -Local Regression • 局部回归方法: The amount of averaging is controlled by a smoothing parameter. The choice of smoothing parameter is related to the balances between bias and variance. 取每个局部点 x0 附近,长度s=k/n的邻域分段 依据距离,为邻域内点赋予权重K 0 ,外围点权重为0 n 最小二乘拟合,使估计参数满足:min K i 0 ( yi 0 1 xi ) 2 i 1 联合各点函数拟合预测模型 自变量较多,可考虑有选择的选取自变量进行局部回归 维数≤3,4;高维模型稳定性易受训练集稀疏性的制约 带宽变 化时模 式的变 化 9 2016/12/21 2.局部多项式回归 回忆标准非参数型: Yi m( X i ) i , i 1, , n (1) m( x0 ) m( x) m( x0 ) m( x0 )( x x0 ) ( x x0 ) 2 L ... 2! m p ( x0 ) ( x x0 ) p O ( x x0 ) p 1 p! 在待估计点附近做局部多项式拟合: n p 2 Y X x K X x t 1 t j 0 j j t 0 h t 为了实现局部多项式估计,我们需要选择多项式的阶数p , 带宽h以及核函数K .当然这些参数相互关联.当 h 时, 局部多项式拟合就变成全局多项式拟合,阶数 p决定模型的 复杂性。 0 局部多项式的矩阵表示为: min y X W y X T 11 12 2 2016/12/21 与参数模型不同,局部多项式估计拟合的复杂性是 由带宽来控制的, 通常 p 是较小的,故而选择 p 的问 题就变得不重要了.如果目的是估计 mv ,则当 p v 是奇数,局部多项式拟合自动修正边界偏倚.进一 步,则当 p v是奇数,与 p 1阶拟合相比较,p 阶 拟合包含了一个多余常数,但没有增加 mv估计的 方差。不过这个参数创造了一个降低偏倚的机会, 特别是在边界区域.另一方面,带宽 h 的选择在多 h 项式拟合中起着重要作用.太大的带宽引起过渡平 滑,产生过大的建模偏倚,而太小的带宽会导致不 足平滑,获得受干扰的估计。 局部回归中不同的窗宽结果 14 3.稳健回归LOWESS locally weighted scatterplot smoother • 基本思想: 局部线性估计 稳健的权重平滑 (残差大的减小权重) MAD MAD=median(|ri-median(ri)|) 15 #Step 2 #Applying the Tricube Weight #Tricube function tricube <- function(z) { ifelse (abs(z) < 1, (1 - (abs(z))^3)^3, 0) } #Bisquare weight bisquare <- function(z) { ifelse (abs(z) < 1, (1 - (abs(z))^2)^2, 0) } plot(range(TIME), c(0,1), xlab="Time (in days)", ylab="Tricube Weight", type='n', main="The Tricube Weight") abline(v=c(x0-which.diff, x0+which.diff), lty=2) abline(v=x0) xwts <- seq(x0-which.diff, x0+which.diff, len=250) lines(xwts, tricube((xwts-x0)/which.diff), lty=1, lwd=2) points(x.n, tricube((x.n - x0)/which.diff), cex=2) 17 #Step1 #Defining the window width plot(TIME, LIBERAL, xlab="Time (in days)", ylab="Liberal Support", type='n', main="Defining the Window Width") ord <- order(TIME) Lib=LIBERAL time <- TIME[ord] pre <- LIBERAL[ord] x0 <- time[10] diffs <- abs(time - x0) which.diff <- sort(diffs)[16] abline(v=c(x0-which.diff, x0+which.diff), lty=2) abline(v=x0) points(time[diffs > which.diff], Lib[diffs > which.diff], pch=16, cex=2, col=gray(.75)) points(time[diffs <= which.diff], Lib[diffs <= which.diff],cex=2) x.n <- time[diffs <= which.diff] y.n <- Lib[diffs <= which.diff] 16 text(locator(1), "Window Width") #Step 3 #The local polynomial plot(TIME, LIBERAL, xlab="Time (in days)", ylab="Liberal Support", type='n', main="Local Linear Regression") abline(v=c(x0-which.diff, x0+which.diff), lty=2) abline(v=x0) points(x.n, y.n, cex=2) mod <- lm(y.n ~ x.n, weights=tricube((x.n-x0)/which.diff)) reg.line(mod, lwd=2, col=1) points(x0, predict(mod, data.frame(x.n=x0)), pch=16, cex=1.8) text(locator(1), "Fitted Value of Y at Focal X") 18 3 2016/12/21 19 20 21 22 library(car) # for data sets data(Prestige) attach(Prestige) plot(income, prestige, xlab="Average Income", ylab="Prestige") lines(lowess(income, prestige, f=0.5, iter=0), lwd=2) lines(lowess(income, prestige, f=0.8, iter=0), lwd=2,col=4) lines(lowess(income, prestige, f=0.1, iter=0), lwd=2,col=6) 23 24 4 2016/12/21 普通的局部多项式回归 稳健的多项式回归 对异常值的 变化 25 案例:NOx排放量与发动机性能之间的关系 26 Data:NOx排放物数据ethanol 发动机的压缩比 背景:重度雾霾政策解读----减少机动车行驶 已有的研究 •发动机压缩比:高压缩比发动机高温作业下,可引发轻微爆燃 现象,导致NOx排放量增加。 •燃料空气当量比:燃料与空气比例小于1或在1附近时,对应着 空气未得到完全燃烧,造成燃烧效率低下,产生较多尾气。 问题: 有没有关系 •两个变量对Nox实际会产生怎样的影响? •影响的模式是怎样的? 关系如何定义 •模型中的参数是怎样估计的? 稳定的关系是通过 27 参数如何控制的 排放物NOx 成分多少 燃料-空气当量比 NOx CompRatio EquivRatio 3.741 2.295 1.498 2.881 0.76 3.12 0.638 1.17 2.358 0.606 12 12 12 12 12 9 9 9 12 12 0.907 0.761 1.108 1.016 1.189 1.001 1.231 1.123 1.042 1.215 #cross-validation alpha =seq(0.2,1,by=0.02) n1=length(alpha) g=matrix(nrow=n1,ncol=4) for (k in 1:length(alpha)) { g[k,]=gcv(NOx~lp(E,nn=alpha[k]),data=ethanol)} plot(g[,4]~g[,3],ylab="GCV",xlab="degrees of freedom") f1=locfit(NOx~lp(E,nn=0.3),data=ethanol) plot(f1) 散点图和局部线性模型 plot(NOx~C,data=ethanol) fit=locfit(NOx~lp(E,nn=0.5),data=ethanol) plot(E,NOx,data=ethanol) lines(fit) 29 30 5 2016/12/21 fit1=locfit(NOx~lp(C,E,nn=0.3,scale=0),data=ethanol) plot(fit1) 高排量 汽车 发动机不 充分燃烧 31 6

非参数统计第八讲.pdf




