ISLR习题:线性回归 - 共线性问题

目录

本文源自《统计学习导论:基于R语言应用》(ISLR) 第三章习题

创建一组有共线性关系的数据集

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 + x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

$$ y = 2 + 2x_1 + 0.3x_2 $$

相关性

cor(x1, x2)
[1] 0.9469723
plot(x1, x2)

拟合

lm_fit_v1 <- lm(y ~ x1 + x2)
summary(lm_fit_v1)
Call:
lm(formula = y ~ x1 + x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8311 -0.7273 -0.0537  0.6338  2.3359 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   1.7757     0.5933   2.993  0.00351 **
x1            1.0847     1.2346   0.879  0.38179   
x2            1.0097     1.1337   0.891  0.37536   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.056 on 97 degrees of freedom
Multiple R-squared:  0.2333,	Adjusted R-squared:  0.2175 
F-statistic: 14.76 on 2 and 97 DF,  p-value: 2.54e-06

预测的 beta_1 和 beta_2 与真实值相差过大

x1 和 x2 系数的 p 值过小,不能拒绝零假设

单变量拟合

y 对 x1

lm_fit_x1 <- lm(y ~ x1)
summary(lm_fit_x1)
Call:
lm(formula = y ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.89495 -0.66874 -0.07785  0.59221  2.45560 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.2624     0.2307   9.805 3.21e-16 ***
x1            2.1259     0.3963   5.365 5.42e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.055 on 98 degrees of freedom
Multiple R-squared:  0.227,	Adjusted R-squared:  0.2191 
F-statistic: 28.78 on 1 and 98 DF,  p-value: 5.42e-07

p 值几乎为 0,可以拒绝零假设

y 对 x2

lm_fit_x2 <- lm(y ~ x2)
summary(lm_fit_x2)
Call:
lm(formula = y ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.74970 -0.68815 -0.03074  0.66090  2.34837 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.3789     0.3845   3.587 0.000525 ***
x2            1.9529     0.3639   5.367 5.36e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.055 on 98 degrees of freedom
Multiple R-squared:  0.2272,	Adjusted R-squared:  0.2193 
F-statistic: 28.81 on 1 and 98 DF,  p-value: 5.361e-07

p 值几乎为 0,可以拒绝零假设

错误观测

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

多变量

y 对 x1 和 x2

lm_fit_v2 <- lm(y ~ x1 + x2)
summary(lm_fit_v2)
Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.77906 -0.72031 -0.05796  0.62800  3.04112 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   1.5360     0.6115   2.512   0.0136 *
x1            0.1292     1.2407   0.104   0.9173  
x2            1.7624     1.1500   1.532   0.1286  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.099 on 98 degrees of freedom
Multiple R-squared:  0.201,	Adjusted R-squared:  0.1847 
F-statistic: 12.32 on 2 and 98 DF,  p-value: 1.682e-05
plot(lm_fit_v2)

新增观测是离群点,也是高杠杆点

单变量

y 对 x1

lm_fit_x1_v2 <- lm(y ~ x1)
summary(lm_fit_x1_v2)
Call:
lm(formula = y ~ x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8899 -0.6553 -0.0917  0.5679  3.4070 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.4005     0.2378   10.09  < 2e-16 ***
x1            1.9251     0.4104    4.69 8.74e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.106 on 99 degrees of freedom
Multiple R-squared:  0.1818,	Adjusted R-squared:  0.1736 
F-statistic:    22 on 1 and 99 DF,  p-value: 8.744e-06
plot(lm_fit_x1_v2)

新增观测是离群点,也是高杠杆点

y 对 x2

lm_fit_x2_v2 <- lm(y ~ x2)
summary(lm_fit_x2_v2)
Call:
lm(formula = y ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.76917 -0.70920 -0.04555  0.64028  3.01186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.4877     0.3964   3.753 0.000295 ***
x2            1.8755     0.3760   4.989  2.6e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.093 on 99 degrees of freedom
Multiple R-squared:  0.2009,	Adjusted R-squared:  0.1928 
F-statistic: 24.89 on 1 and 99 DF,  p-value: 2.601e-06
plot(lm_fit_x2_v2)

新增观测是离群点,但不是高杠杆点

参考

https://github.com/perillaroc/islr-study

ISLR实验系列文章

线性回归

分类

重抽样方法

线性模型选择与正则化