ISLR实验:分类 - 线性判别分析

目录

本文源自《统计学习导论:基于R语言应用》(ISLR) 中《4.6 R实验:逻辑斯谛回归、LDA、QDA和KNN》章节

library(ISLR)
library(MASS)
library(pROC)

数据

股票市场数据

data(Smarket)
head(Smarket)
  Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up

训练集和测试集

训练集:2001 至 2004 年

测试集:2005 年

train <- (Year < 2005)

train 是一个布尔变量,Boolean vector

smarket_2005 <- Smarket[!train, ]
dim(smarket_2005)
[1] 252   9
direction_2005 <- Direction[!train]

方法

MASS 包的 lda() 函数实现线性判别分析

lda_fit <- lda(
  Direction ~ Lag1 + Lag2,
  data=Smarket,
  subset=train
)
lda_fit
Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

Coefficients of linear discriminants:
            LD1
Lag1 -0.6420190
Lag2 -0.5135293

plot() 函数生成线性判别图像

plot(lda_fit)

预测

predict() 返回三元列表

  • class 存储预测结果
  • posterior 是后验概率
  • x 是线性判别
lda_predict <- predict(
  lda_fit,
  smarket_2005
)
names(lda_predict)
[1] "class"     "posterior" "x"     

后验概率

lda_predict$posterior[1:20, ]
          Down        Up
999  0.4901792 0.5098208
1000 0.4792185 0.5207815
1001 0.4668185 0.5331815
1002 0.4740011 0.5259989
1003 0.4927877 0.5072123
1004 0.4938562 0.5061438
1005 0.4951016 0.5048984
1006 0.4872861 0.5127139
1007 0.4907013 0.5092987
1008 0.4844026 0.5155974
1009 0.4906963 0.5093037
1010 0.5119988 0.4880012
1011 0.4895152 0.5104848
1012 0.4706761 0.5293239
1013 0.4744593 0.5255407
1014 0.4799583 0.5200417
1015 0.4935775 0.5064225
1016 0.5030894 0.4969106
1017 0.4978806 0.5021194
1018 0.4886331 0.5113669

预测结果

lda_predict$class[1:20]
 [1] Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Down Up  
[14] Up   Up   Up   Up   Down Up   Up  
Levels: Down Up

线性判据

lda_predict$x[1:20]
 [1]  0.08293096  0.59114102  1.16723063  0.83335022 -0.03792892
 [6] -0.08743142 -0.14512719  0.21701324  0.05873792  0.35068642
[11]  0.05897298 -0.92794134  0.11370190  0.98783874  0.81206862
[16]  0.55681363 -0.07452314 -0.51514029 -0.27386231  0.15458312

列联表

lda_class <- lda_predict$class
table(direction_2005, lda_class)
              lda_class
direction_2005 Down  Up
          Down   35  76
          Up     35 106
mean(lda_class == direction_2005)
[1] 0.5595238

class 使用 50% 作为阈值

sum(lda_predict$posterior[,1] > .5)
[1] 70
sum(lda_predict$posterior[,1] <= .5)
[1] 182

使用 90% 作为阈值

sum(lda_predict$posterior[,1] > .9)
[1] 0

ROC 曲线

plot(
  roc(
    direction_2005,
    lda_predict$posterior[,2],
    percent=TRUE
  ),
  print.auc=TRUE,
  plot=TRUE
)

参考

https://github.com/perillaroc/islr-study

ISLR实验系列文章

线性回归

分类

重抽样方法

线性模型选择与正则化