R语言实战:回归

目录

本文内容来自《R 语言实战》(R in Action, 2nd),有部分修改

回归的多面性

  • 简单线性
  • 多项式
  • 多层
  • 多元线性
  • 多变量
  • Logistic
  • 泊松
  • Cox比例风险
  • 时间序列
  • 非线性
  • 非参数
  • 稳健

普通最小二乘 (OLS) 回归法,包括简单线性回归,多项式回归和多元线性回归。

OLS 回归的适用情景

  • 发现有趣的问题
  • 设计一个有用的、可以测量的响应变量
  • 收集合适的数据

基础回顾

OLS 回归

$$ \hat{Y_i}=\hat{\beta_0} + \hat{\beta_1}X_{1i} + … + \hat{\beta_k}X_{ki} $$

残差平方和最小

$$ \sum_{i=1}^{n}(Y_i-\hat{Y_i})^2=\sum_{i=1}^{n}(Y_i-\hat{\beta_0}+\hat{\beta_1}X_{1i}+…+\hat{\beta_k}X_{ki})=\sum_{i=1}^{n}\epsilon_i^2 $$

统计假设

  • 正态性:对于固定的自变量值,因变量值成正态分布
  • 独立性:Y_i 值之间相互独立
  • 线性:因变量与自变量之间为线性相关
  • 同方差性:因变量的方差不随自变量的水平不同而变化

lm() 拟合回归模型

简单线性回归

多项式回归

多元线性回归

简单线性回归

数据集

head(women)
  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129

summary() 函数

fit <- lm(
  weight ~ height, 
  data=women
)
summary(fit)
Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991,	Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

目标值

women$weight
[1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

fitted() 函数计算预测值

fitted(fit)
       1        2        3        4        5        6        7        8        9       10       11       12 
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 140.1833 143.6333 147.0833 150.5333 
      13       14       15 
153.9833 157.4333 160.8833 

residuals() 函数返回残差

residuals(fit)
          1           2           3           4           5           6           7           8           9 
 2.41666667  0.96666667  0.51666667  0.06666667 -0.38333333 -0.83333333 -1.28333333 -1.73333333 -1.18333333 
         10          11          12          13          14          15 
-1.63333333 -1.08333333 -0.53333333  0.01666667  1.56666667  3.11666667

abline() 绘制拟合直线

plot(
  women$height,
  women$weight,
  xlab="Height (in inches)",
  ylab="Weight (in pounds)"
)
abline(fit)

多项式回归

fit2 <- lm(
  weight ~ height + I(height^2),
  data=women
)
summary(fit2)
Call:
lm(formula = weight ~ height + I(height^2), data = women)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50941 -0.29611 -0.00941  0.28615  0.59706 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
height       -7.34832    0.77769  -9.449 6.58e-07 ***
I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3841 on 12 degrees of freedom
Multiple R-squared:  0.9995,	Adjusted R-squared:  0.9994 
F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16
plot(
  women$height,
  women$weight,
  xlab="Height (in inches)",
  ylab="Weight (in lbs)"
)
lines(women$height, fitted(fit2))

car 包的 scatterplot() 函数

library(car)
scatterplot(
  weight ~ height,
  data=women,
  smooth=list(lty=2),
  pch=19,
  main="Women Age 30-39",
  xlab="Height (in inches)",
  ylab="Weight (in lbs)"
)

多元线性回归

head(state.x77)
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
states <- as.data.frame(
  state.x77[,c(
    "Murder",
    "Population",
    "Illiteracy",
    "Income",
    "Frost"
  )]
)
head(states)
           Murder Population Illiteracy Income Frost
Alabama      15.1       3615        2.1   3624    20
Alaska       11.3        365        1.5   6315   152
Arizona       7.8       2212        1.8   4530    15
Arkansas     10.1       2110        1.9   3378    65
California   10.3      21198        1.1   5114    20
Colorado      6.8       2541        0.7   4884   166

cor() 检查相关性

cor(states)
               Murder Population Illiteracy     Income      Frost
Murder      1.0000000  0.3436428  0.7029752 -0.2300776 -0.5388834
Population  0.3436428  1.0000000  0.1076224  0.2082276 -0.3321525
Illiteracy  0.7029752  0.1076224  1.0000000 -0.4370752 -0.6719470
Income     -0.2300776  0.2082276 -0.4370752  1.0000000  0.2262822
Frost      -0.5388834 -0.3321525 -0.6719470  0.2262822  1.0000000

car 包的 scatterplotMatrix() 函数

scatterplotMatrix(
  states,
  smooth=list(lty=2),
  main="Scatter Plot Matrix"
)

fit_multi <- lm(
  Murder ~ Population + Illiteracy + Income + Frost,
  data=states
)
summary(fit_multi)
Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = states)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
Population  2.237e-04  9.052e-05   2.471   0.0173 *  
Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
Income      6.442e-05  6.837e-04   0.094   0.9253    
Frost       5.813e-04  1.005e-02   0.058   0.9541    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567,	Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

有交互项的多元线性回归

fit_inter <- lm(
  mpg ~ hp + wt + hp:wt,
  data=mtcars
)
summary(fit_inter)
Call:
lm(formula = mpg ~ hp + wt + hp:wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0632 -1.6491 -0.7362  1.4211  4.5513 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
hp          -0.12010    0.02470  -4.863 4.04e-05 ***
wt          -8.21662    1.26971  -6.471 5.20e-07 ***
hp:wt        0.02785    0.00742   3.753 0.000811 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.153 on 28 degrees of freedom
Multiple R-squared:  0.8848,	Adjusted R-squared:  0.8724 
F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

effects 包中的 effect() 函数

library(effects)

展示 wt 取 3 个值时,mpg 和 hp 的关系

plot(
  effect(
    "hp:wt",
    fit_inter,
    xlevels=list(
      wt=c(2.2, 3.2, 4.2)
    )
  ),
  multiline=TRUE
)

回归诊断

confint() 函数

confint(fit_multi)
                    2.5 %       97.5 %
(Intercept) -6.552191e+00 9.0213182149
Population   4.136397e-05 0.0004059867
Illiteracy   2.381799e+00 5.9038743192
Income      -1.312611e-03 0.0014414600
Frost       -1.966781e-02 0.0208304170

回归诊断技术

标准方法

fit <- lm(
  weight ~ height,
  data=women
)
summary(fit)
Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991,	Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

plot() 函数生成 4 幅图

plot(fit)

残差与拟合图

Residuals vs Fitted

线性。如果满足,残差值与拟合值没有任何关联

正态 Q-Q 图

Normal Q-Q

正态性。如果满足,残差值应该是均值为 0 的正态分布,沿 45 度角直线分布

位置尺度图

Scale-Location Graph

同方差性。如果满足,水平线周围点应该随机分布

残差与杠杆图

Residuals vs Leverage

鉴别离群点,杠杆点和强影响点

fit2 <- lm(
  weight ~ height + I(height^2),
  data=women
)
summary(fit2)
Call:
lm(formula = weight ~ height + I(height^2), data = women)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50941 -0.29611 -0.00941  0.28615  0.59706 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
height       -7.34832    0.77769  -9.449 6.58e-07 ***
I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3841 on 12 degrees of freedom
Multiple R-squared:  0.9995,	Adjusted R-squared:  0.9994 
F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16
plot(fit2)

newfit <- lm(
  weight ~ height + I(height^2),
  data=women[-c(13, 15)]
)
summary(newfit)
Call:
lm(formula = weight ~ height + I(height^2), data = women[-c(13, 
    15)])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50941 -0.29611 -0.00941  0.28615  0.59706 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
height       -7.34832    0.77769  -9.449 6.58e-07 ***
I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3841 on 12 degrees of freedom
Multiple R-squared:  0.9995,	Adjusted R-squared:  0.9994 
F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16
fit <- lm(
  Murder ~ Population + Illiteracy + Income + Frost,
  data=states
)
summary(fit)
Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = states)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
Population  2.237e-04  9.052e-05   2.471   0.0173 *  
Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
Income      6.442e-05  6.837e-04   0.094   0.9253    
Frost       5.813e-04  1.005e-02   0.058   0.9541    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567,	Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08
plot(fit)

改进方法

car 包,gvlma

fit <- lm(
  Murder ~ Population + Illiteracy + Income + Frost,
  data=states
)

正态性

car::qqPlot()

qqPlot(
  fit,
  labels=row.names(states),
  id.method="identify",
  simulate=TRUE,
  main="Q-Q Plot"
)
      Nevada Rhode Island 
          28           39 

states["Nevada",]
       Murder Population Illiteracy Income Frost
Nevada   11.5        590        0.5   5149   188
fitted(fit)["Nevada"]
  Nevada 
3.878958 
residuals(fit)["Nevada"]
  Nevada 
7.621042 
rstudent(fit)["Nevada"]
  Nevada 
3.542929

下面函数生成学生化残差的柱状图

residplot <- function(fit, nbreaks=10) {
  z <- rstudent(fit)
  hist(
    z,
    breaks=nbreaks,
    freq=FALSE,
    xlab="Studentized Residual",
    main="Distribution of Error"
  )
  rug(
    jitter(z),
    col="brown"
  )
  curve(
    dnorm(
      x,
      mean=mean(z),
      sd=sd(z)
    ),
    add=TRUE,
    col="blue",
    lwd=2
  )
  lines(
    density(z)$x,
    density(z)$y,
    col="red",
    lwd=2,
    lty=2
  )
  legend(
    "topright",
    legend=c("Normal Curve", "Kernel Density Curve"),
    lty=1:2,
    col=c("blue", "red"),
    cex=.7
  )
}
residplot(fit)

误差的独立性

Durbin-Watson 检验

car 包的 durbinWastsonTest() 函数

durbinWatsonTest(fit)
 lag Autocorrelation D-W Statistic p-value
   1      -0.2006929      2.317691   0.256
 Alternative hypothesis: rho != 0

p 不显著 (p=0.254),说明无自相关性,误差项之间独立

线性

成分残差图 component plus residual plot,偏残差图 partial residual plot

car 包的 crPlots() 函数

crPlots(fit)

如果图形存在非线性,说明建模不够充分

同方差性

car 包的 ncvTest() 函数

零假设为误差方差不变

ncvTest(fit)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 1.746514, Df = 1, p = 0.18632

p 值不显著,说明满足零假设,即满足方差不变假设

car 包的 spreadLevelPlot() 函数

spreadLevelPlot(fit)
Suggested power transformation:  1.209626

线性模型假设的综合验证

gvlma 包中的 gvlma() 函数

library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)
Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = states)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
Population  2.237e-04  9.052e-05   2.471   0.0173 *  
Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
Income      6.442e-05  6.837e-04   0.094   0.9253    
Frost       5.813e-04  1.005e-02   0.058   0.9541    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567,	Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = fit) 

                    Value p-value                Decision
Global Stat        2.7728  0.5965 Assumptions acceptable.
Skewness           1.5374  0.2150 Assumptions acceptable.
Kurtosis           0.6376  0.4246 Assumptions acceptable.
Link Function      0.1154  0.7341 Assumptions acceptable.
Heteroscedasticity 0.4824  0.4873 Assumptions acceptable.

多重共线性

VIF,Variance Inflation Factor,方差膨胀因子

car 包的 vif() 函数

vif(fit)
Population Illiteracy     Income      Frost 
  1.245282   2.165848   1.345822   2.082547 

一般原则,\sqrt{vif} > 2 表明存在多重共线性问题

sqrt(vif(fit)) > 2
Population Illiteracy     Income      Frost 
     FALSE      FALSE      FALSE      FALSE 

异常观测值

离群点

粗糙判断的两种方法:

  • Q-Q 图中落在置信区间带外的点
  • 标准化残差值大于 2 或者小于 -2

car 包的 outlierTest() 函数

outlierTest(fit)
       rstudent unadjusted p-value Bonferroni p
Nevada 3.542929         0.00095088     0.047544

高杠杆点

帽子统计量,hat statistic

帽子值大于帽子均值的 2 或 3 倍,帽子均值为 p/n

hat.plot <- function(fit) {
  p <- length(coefficients(fit))
  n <- length(fitted(fit))
  plot(
    hatvalues(fit),
    main="Index Plot of Hat Values"
  )
  abline(
    h=c(2, 3)*p/n,
    col="red",
    lty=2
  )
}
hat.plot(fit)

强影响点

两种检测方法:

  • Cook 距离,D 统计量,大于 4/(n-k-1)
  • 变量添加图,added variable plot
cutoff <- 4 / (
  nrow(states) - length(fit$coefficients) - 2
)
plot(
  fit,
  which=4,
  cook.levels=cutoff
)
abline(
  h=cutoff,
  lty=2,
  col="red"
)

car 包的 avPlot() 函数

avPlots(
  fit,
  ask=FALSE
)

car 包的 influencePlot() 函数

influencePlot(
  fit,
  main="Influence Plot",
  sub="Circle size if proportional to Cook's distance"
)
                StudRes        Hat       CookD
Alaska        1.7536917 0.43247319 0.448050997
California   -0.2761492 0.34087628 0.008052956
Nevada        3.5429286 0.09508977 0.209915743
Rhode Island -2.0001631 0.04562377 0.035858963

纵坐标大于 +2 或小于 -2 的点是离群点。

横坐标大于 0.2 或 0.3 的点是高杠杆点。

圆圈很大的点是强影响点。

改进措施

删除观测点

谨慎删除离群点和强影响点

变量变换

car 包的 powerTransform() 函数通过 lambda 的最大似然估计来正态化变量 X^{lambda}

summary(
  powerTransform(states$Murder)
)
bcPower Transformation to Normality 
              Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
states$Murder    0.6055           1       0.0884       1.1227

Likelihood ratio test that transformation parameter is equal to 0
 (log transformation)
                           LRT df     pval
LR test, lambda = (0) 5.665991  1 0.017297

Likelihood ratio test that no transformation is needed
                           LRT df    pval
LR test, lambda = (1) 2.122763  1 0.14512

car 包的 boxTidwell() 函数最大似然估计预测变量幂数

boxTidwell(
  Murder ~ Population + Illiteracy,
  data=states
)
           MLE of lambda Score Statistic (z) Pr(>|z|)
Population       0.86939             -0.3228   0.7468
Illiteracy       1.35812              0.6194   0.5357

iterations =  19 

删增变量

删除变量方法举例:

  • 删除某个多重共线性的变量
  • 岭回归

尝试其他方法

存在离群点和/或强影响点:稳健回归模型

违背正态性假设:非参数回归模型

显著的非线性:非线性回归模型

违背误差独立性假设:时间序列模型,多层次回归模型。。。

广义线性模型

选择“最佳”的回归模型

模型比较

anova() 函数比较两个嵌套模型的拟合优度

fit1 <- lm(
  Murder ~ Population + Illiteracy + Income + Frost,
  data=states
)
fit2 <- lm(
  Murder ~ Population + Illiteracy,
  data=states
)

anova(fit2, fit1)
Analysis of Variance Table

Model 1: Murder ~ Population + Illiteracy
Model 2: Murder ~ Population + Illiteracy + Income + Frost
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     47 289.25                           
2     45 289.17  2  0.078505 0.0061 0.9939

赤池信息准则(AIC,Akaike Information Criterion)

AIC 较小的模型优先选择

fit1 <- lm(
  Murder ~ Population + Illiteracy + Income + Frost,
  data=states
)
fit2 <- lm(
  Murder ~ Population + Illiteracy,
  data=states
)
AIC(fit1, fit2)
     df      AIC
fit1  6 241.6429
fit2  4 237.6565

变量选择

逐步回归法

stepwise method

向前逐步回归 forward stepwise regression

向后逐步回归 backward stepwise regression

向前向后逐步回归,逐步回归,stepwise stepwise regression

MASS 包中的 stepAIC() 函数

library(MASS)
fit <- lm(
  Murder ~ Population + Illiteracy + Income + Frost,
  data=states
)
stepAIC(fit, direction="backward")
Start:  AIC=97.75
Murder ~ Population + Illiteracy + Income + Frost

             Df Sum of Sq    RSS     AIC
- Frost       1     0.021 289.19  95.753
- Income      1     0.057 289.22  95.759
<none>                    289.17  97.749
- Population  1    39.238 328.41 102.111
- Illiteracy  1   144.264 433.43 115.986

Step:  AIC=95.75
Murder ~ Population + Illiteracy + Income

             Df Sum of Sq    RSS     AIC
- Income      1     0.057 289.25  93.763
<none>                    289.19  95.753
- Population  1    43.658 332.85 100.783
- Illiteracy  1   236.196 525.38 123.605

Step:  AIC=93.76
Murder ~ Population + Illiteracy

             Df Sum of Sq    RSS     AIC
<none>                    289.25  93.763
- Population  1    48.517 337.76  99.516
- Illiteracy  1   299.646 588.89 127.311

Call:
lm(formula = Murder ~ Population + Illiteracy, data = states)

Coefficients:
(Intercept)   Population   Illiteracy  
  1.6515497    0.0002242    4.0807366 

全子集回归

leaps 包的 regsubsets() 函数

R 平方

调整 R 平方

Mallows Cp 统计量

library(leaps)
leaps <- regsubsets(
  Murder ~ Population + Illiteracy + Income + Frost,
  data=states,
  nbest=4
)
summary(leaps)
Subset selection object
Call: regsubsets.formula(Murder ~ Population + Illiteracy + Income + 
    Frost, data = states, nbest = 4)
4 Variables  (and intercept)
           Forced in Forced out
Population     FALSE      FALSE
Illiteracy     FALSE      FALSE
Income         FALSE      FALSE
Frost          FALSE      FALSE
4 subsets of each size up to 4
Selection Algorithm: exhaustive
         Population Illiteracy Income Frost
1  ( 1 ) " "        "*"        " "    " "  
1  ( 2 ) " "        " "        " "    "*"  
1  ( 3 ) "*"        " "        " "    " "  
1  ( 4 ) " "        " "        "*"    " "  
2  ( 1 ) "*"        "*"        " "    " "  
2  ( 2 ) " "        "*"        " "    "*"  
2  ( 3 ) " "        "*"        "*"    " "  
2  ( 4 ) "*"        " "        " "    "*"  
3  ( 1 ) "*"        "*"        "*"    " "  
3  ( 2 ) "*"        "*"        " "    "*"  
3  ( 3 ) " "        "*"        "*"    "*"  
3  ( 4 ) "*"        " "        "*"    "*"  
4  ( 1 ) "*"        "*"        "*"    "*"  

每一行表示一个组合,变量是色块对应的横坐标

plot(leaps, scale="adjr2")

car 包的 subsets() 函数

越好的模型越接近直线

subsets(
  leaps,
  statistic="cp",
  main="Cp Plot for All Subsets Regression",
  legend="topright"
)
abline(
  1, 1,
  lty=2,
  col="red"
)

深层次分析

交叉验证

模型泛化能力

k 重交叉验证

bootstrap 包的 crossval() 函数

shrinkage <- function(fit, k=10) {
  require(bootstrap)
  theta.fit <- function(x, y)(lsfit(x, y))
  theta.predict <- function(fit, x){
    cbind(1, x)%*%fit$coefficients
  }
  
  x <- fit$model[, 2:ncol(fit$model)]
  y <- fit$model[, 1]
  
  results <- crossval(
    x, y,
    theta.fit,
    theta.predict,
    ngroup=k
  )
  
  r2 <- cor(
    y,
    fit$fitted.values
  )^2
  
  r2cv <- cor(
    y,
    results$cv.fit
  )^2
  
  cat("Original R-square =", r2, "\n")
  cat(k, "Fold Cross-Validated R-square = ", r2cv, "\n")
  cat("Change =", r2 - r2cv, "\n")
}
fit <- lm(
  Murder ~ Population + Income + Illiteracy + Frost,
  data=states
)
shrinkage(fit)
Original R-square = 0.5669502 
10 Fold Cross-Validated R-square =  0.4707088 
Change = 0.09624146 
fit2 <- lm(
  Murder ~ Population + Illiteracy,
  data=states
)
shrinkage(fit2)
Original R-square = 0.5668327 
10 Fold Cross-Validated R-square =  0.533334 
Change = 0.03349863 

相对重要性

归一化

zstates <- as.data.frame(scale(states))
zfit <- lm(
  Murder ~ Population + Income + Illiteracy + Frost,
  data=zstates
)
coef(zfit)
  (Intercept)    Population        Income    Illiteracy         Frost 
-2.054026e-16  2.705095e-01  1.072372e-02  6.840496e-01  8.185407e-03

相对权重,relative weight,对所有可能子模型添加一个预测变量引起的 R 平方平均增加量的一个近似值

relweights <- function(fit, ...) {
  R <- cor(fit$model)
  nvar <- ncol(R)
  
  rxx <- R[2:nvar, 2:nvar]
  rxy <- R[2:nvar, 1]
  
  svd <- eigen(rxx)
  evec <- svd$vectors
  ev <- svd$values
  delta <- diag(sqrt(ev))
  lambda <- evec %*% delta %*% t(evec)
  lambdasq <- lambda^2
  
  beta <- solve(lambda) %*% rxy
  rsquare <- colSums(beta ^ 2)
  rawwgt <- lambdasq %*% beta ^ 2
  
  import <- (rawwgt / rsquare) * 100
  import <- as.data.frame(import)
  row.names(import) <- names(fit$model[2:nvar])
  names(import) <- "Weights"
  import <- import[order(import), 1, drop=FALSE]
  dotchart(
    import$Weights,
    labels=row.names(import),
    xlab="% of R-Square",
    pch=19,
    main="Relative Importance of Predictor Variables",
    sub=paste("Total R-Square=", round(rsquare, digits=3)),
    ...
  )
  return(import)
}
fit <- lm(
  Murder ~ Population + Illiteracy + Income + Frost,
  data=states
)
relweights(fit, col="blue")
             Weights
Income      5.488962
Population 14.723401
Frost      20.787442
Illiteracy 59.000195

参考

https://github.com/perillaroc/r-in-action-study

R 语言实战

图形初阶

基本数据管理

高级数据管理

基本图形

基本统计分析