R语言实战:分类

目录

本文内容来自《R 语言实战》(R in Action, 2nd),有部分修改

介绍有监督机器学习领域中四种可用于分类的方法:

  • 逻辑回归
  • 决策树
  • 随机森林
  • 支持向量机
library(glue)

数据

威斯康星州乳腺癌数据集

loc <- "https://archive.ics.uci.edu/ml/machine-learning-databases/"
ds <- "breast-cancer-wisconsin/breast-cancer-wisconsin.data"
url <- glue("{loc}{ds}")

加载网络数据,并设置列名。 数据中的缺失数据用 ? 表示。

breast <- read.table(
  url,
  sep=",",
  header=FALSE,
  na.strings="?"
)
names(breast) <- c(
  "ID", 
  "clumpThickness",
  "sizeUniformity",
  "shapeUniformity",
  "maginalAdhesion",
  "singleEpithelialCellSize",
  "bareNuclei",
  "blankChromatin",
  "normalNucleoli",
  "mitosis",
  "class"
)
head(breast)
       ID clumpThickness sizeUniformity shapeUniformity maginalAdhesion
1 1000025              5              1               1               1
2 1002945              5              4               4               5
3 1015425              3              1               1               1
4 1016277              6              8               8               1
5 1017023              4              1               1               3
6 1017122              8             10              10               8
  singleEpithelialCellSize bareNuclei blankChromatin normalNucleoli
1                        2          1              3              1
2                        7         10              3              2
3                        2          2              3              1
4                        3          4              3              7
5                        2          1              3              1
6                        7         10              9              7
  mitosis class
1       1     2
2       1     2
3       1     2
4       1     2
5       1     2
6       1     4

替换 class

df <- breast[-1]
df$class <- factor(df$class, levels=c(2, 4), labels=c("benign", "malignant"))
head(df)
  clumpThickness sizeUniformity shapeUniformity maginalAdhesion
1              5              1               1               1
2              5              4               4               5
3              3              1               1               1
4              6              8               8               1
5              4              1               1               3
6              8             10              10               8
  singleEpithelialCellSize bareNuclei blankChromatin normalNucleoli
1                        2          1              3              1
2                        7         10              3              2
3                        2          2              3              1
4                        3          4              3              7
5                        2          1              3              1
6                        7         10              9              7
  mitosis     class
1       1    benign
2       1    benign
3       1    benign
4       1    benign
5       1    benign
6       1 malignant

分割训练集和验证集

set.seed(1234)
train <- sample(nrow(df), 0.7*nrow(df))
df_train <- df[train,]
df_validate <- df[-train,]
table(df_train$class)
   benign malignant 
      319       170 
table(df_validate$class)
   benign malignant 
      139        71 

逻辑回归

fit_logit <- glm(
  class~.,
  data=df_train,
  family=binomial()
)
summary(fit_logit)
lm(formula = class ~ ., family = binomial(), data = df_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.24605  -0.08012  -0.03110   0.00266   2.11576  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -12.4412     2.0547  -6.055  1.4e-09 ***
clumpThickness             0.7407     0.2262   3.275  0.00106 ** 
sizeUniformity            -0.0320     0.3399  -0.094  0.92500    
shapeUniformity            0.2073     0.3715   0.558  0.57680    
maginalAdhesion            0.5194     0.1708   3.041  0.00236 ** 
singleEpithelialCellSize  -0.3217     0.2613  -1.231  0.21831    
bareNuclei                 0.5851     0.1881   3.111  0.00187 ** 
blankChromatin             0.8599     0.2923   2.942  0.00326 ** 
normalNucleoli             0.4036     0.1828   2.208  0.02725 *  
mitosis                    0.8923     0.3552   2.512  0.01200 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 621.04  on 477  degrees of freedom
Residual deviance:  52.39  on 468  degrees of freedom
  (11 observations deleted due to missingness)
AIC: 72.39

Number of Fisher Scoring iterations: 9

预测验证集,将概率转换为因子变量

prob <- predict(fit_logit, df_validate, type="response")
logit_predict <- factor(
  prob > .5,
  levels=c(FALSE, TRUE),
  labels=c("benign", "malignant")
)

计算列联表 (混淆矩阵)

logit_perfect <- table(
  df_validate$class,
  logit_predict,
  dnn=c("Actual", "Predicted")
)
logit_perfect
           Predicted
Actual      benign malignant
  benign       129         6
  malignant      1        69

逐步逻辑回归

生成一个包含更少解释变量的模型

fit_logit_reduced <- step(fit_logit)
class ~ clumpThickness + sizeUniformity + shapeUniformity + maginalAdhesion + 
    singleEpithelialCellSize + bareNuclei + blankChromatin + 
    normalNucleoli + mitosis

                           Df Deviance    AIC
- sizeUniformity            1   52.399 70.399
- shapeUniformity           1   52.686 70.686
- singleEpithelialCellSize  1   53.910 71.910
<none>                          52.390 72.390
- normalNucleoli            1   57.465 75.465
- mitosis                   1   57.966 75.966
- blankChromatin            1   62.856 80.856
- maginalAdhesion           1   63.044 81.044
- bareNuclei                1   64.982 82.982
- clumpThickness            1   68.323 86.323

Step:  AIC=70.4
class ~ clumpThickness + shapeUniformity + maginalAdhesion + 
    singleEpithelialCellSize + bareNuclei + blankChromatin + 
    normalNucleoli + mitosis

                           Df Deviance    AIC
- shapeUniformity           1   52.876 68.876
- singleEpithelialCellSize  1   53.918 69.918
<none>                          52.399 70.399
- normalNucleoli            1   57.916 73.916
- mitosis                   1   58.024 74.024
- blankChromatin            1   63.272 79.272
- maginalAdhesion           1   63.735 79.735
- bareNuclei                1   64.985 80.985
- clumpThickness            1   68.728 84.728

Step:  AIC=68.88
class ~ clumpThickness + maginalAdhesion + singleEpithelialCellSize + 
    bareNuclei + blankChromatin + normalNucleoli + mitosis

                           Df Deviance    AIC
- singleEpithelialCellSize  1   54.154 68.154
<none>                          52.876 68.876
- mitosis                   1   59.402 73.402
- normalNucleoli            1   60.929 74.929
- blankChromatin            1   64.053 78.053
- maginalAdhesion           1   64.995 78.995
- bareNuclei                1   75.634 89.634
- clumpThickness            1   76.942 90.942

Step:  AIC=68.15
class ~ clumpThickness + maginalAdhesion + bareNuclei + blankChromatin + 
    normalNucleoli + mitosis

                  Df Deviance    AIC
<none>                 54.154 68.154
- mitosis          1   60.177 72.177
- normalNucleoli   1   60.937 72.937
- blankChromatin   1   64.056 76.056
- maginalAdhesion  1   65.022 77.022
- bareNuclei       1   76.417 88.417
- clumpThickness   1   77.027 89.027

预测验证集

prob <- predict(fit_logit_reduced, df_validate, type="response")
logit_predict <- factor(
  prob > .5,
  levels=c(FALSE, TRUE),
  labels=c("benign", "malignant")
)

计算列联表 (混淆矩阵)

logit_perfect <- table(
  df_validate$class,
  logit_predict,
  dnn=c("Actual", "Predicted")
)
logit_perfect
           Predicted
Actual      benign malignant
  benign       130         5
  malignant      1        69

相比上述结果,精简的模型正确分类的样本比全要素模型多 1 个。

决策树

经典决策树

library(rpart)
library(rpart.plot)

使用 rpart 包中的 rpart() 函数构造决策树

set.seed(1234)
dtree <- rpart(
  class~.,
  data=df_train,
  method="class",
  parms=list(split="information")
)
print(dtree)
node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 489 170 benign (0.65235174 0.34764826)  
   2) sizeUniformity< 2.5 304   8 benign (0.97368421 0.02631579)  
     4) clumpThickness< 6.5 297   3 benign (0.98989899 0.01010101) *
     5) clumpThickness>=6.5 7   2 malignant (0.28571429 0.71428571) *
   3) sizeUniformity>=2.5 185  23 malignant (0.12432432 0.87567568)  
     6) bareNuclei< 2.5 28  13 benign (0.53571429 0.46428571)  
      12) sizeUniformity< 3.5 16   1 benign (0.93750000 0.06250000) *
      13) sizeUniformity>=3.5 12   0 malignant (0.00000000 1.00000000) *
     7) bareNuclei>=2.5 157   8 malignant (0.05095541 0.94904459) *

显示预测误差:

  • CP:复杂度参数
  • nsplit:分支数
  • rel error:训练集误差
  • xerror:交叉验证误差
  • xstd:交叉验证误差的标准差
dtree$cptable
          CP nsplit  rel error    xerror       xstd
1 0.81764706      0 1.00000000 1.0000000 0.06194645
2 0.04117647      1 0.18235294 0.1823529 0.03169642
3 0.01764706      3 0.10000000 0.1588235 0.02970979
4 0.01000000      4 0.08235294 0.1235294 0.02637116

绘制交叉误差与复杂度参数的关系图

plotcp(dtree)

使用 prune() 函数进行剪枝

dtree_pruned <- prune(dtree, cp=0.01000000)

使用 rpart.plot 包的 prp() 函数绘制决策树

prp(
  dtree_pruned,
  type=2,
  extra=104,
  fallen.leaves=TRUE,
  main="Decision Tree"
)

预测验证集,并计算列联表

dtree_predict <- predict(
  dtree_pruned,
  df_validate,
  type="class"
)
dtree_pref <- table(
  df_validate$class,
  dtree_predict,
  dnn=c("Actual", "Predicted")
)
dtree_pref
           Predicted
Actual      benign malignant
  benign       129        10
  malignant      2        69

条件推断树

条件推断树基于显著性检验选择变量和分割,而不是纯净度或同质性一类的度量

显著性检验是置换检验

library(party)
fit_ctree <- ctree(
  class~.,
  data=df_train
)

绘图

plot(
  fit_ctree,
  main="Coditional Inference Tree"
)

预测

ctree_predict <- predict(
  fit_ctree,
  df_validate,
  type="response"
)

列联表

ctree_perf <- table(
  df_validate$class,
  ctree_predict,
  dnn=c("Actual", "Predicted")
)
ctree_perf
           Predicted
Actual      benign malignant
  benign       131         8
  malignant      4        67

随机森林

library(randomForest)

使用 randomForest 包的 reandomForest() 函数生成随机森林

set.seed(1234)
fit_forest <- randomForest(
  class~.,
  data=df_train,
  na.action=na.roughfix,
  importance=TRUE
)
fit_forest
Call:
 randomForest(formula = class ~ ., data = df_train, importance = TRUE,      na.action = na.roughfix) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 2.66%
Confusion matrix:
          benign malignant class.error
benign       313         6  0.01880878
malignant      7       163  0.04117647

变量重要性

importance(fit_forest, type=2)
                         MeanDecreaseGini
clumpThickness                  11.382432
sizeUniformity                  63.037519
shapeUniformity                 49.027649
maginalAdhesion                  4.275345
singleEpithelialCellSize        14.504981
bareNuclei                      42.736364
blankChromatin                  22.484755
normalNucleoli                  11.375285
mitosis                          1.755382

预测

forest_predict <- predict(
  fit_forest,
  df_validate
)

列联表

forest_perf <- table(
  df_validate$class,
  forest_predict,
  dnn=c("Actual", "Predicted")
)
forest_perf
           Predicted
Actual      benign malignant
  benign       128         7
  malignant      2        68

party 包中的 cforest() 函数基于条件推断树生成随机森林

支持向量机

SVM

library(e1071)

使用 e1071 包的 svm() 函数实现

set.seed(1234)
fit_svm <- svm(
  class~.,
  data=df_train
)
fit_svm
Call:
svm(formula = class ~ ., data = df_train)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 

Number of Support Vectors:  74

预测

svm_predict <- predict(
  fit_svm,
  na.omit(df_validate)
)

列联表

svm_perf <- table(
  na.omit(df_validate)$class,
  svm_predict,
  dnn=c("Actual", "Predicted")
)
svm_perf
           Predicted
Actual      benign malignant
  benign       126         9
  malignant      0        70

选择调和参数

使用带 RBF 核的 SVM 拟合样本,有两个参数:

  • gamma
  • cost

tune.svm() 函数使用格点搜索法

set.seed(1234)
tuned <- tune.svm(
  class~.,
  data=df_train,
  gamma=10^(-6:1),
  cost=10^(-10:10)
)
tuned
Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation 

- best parameters:
 gamma cost
  0.01    1

- best performance: 0.02740302 

使用最优参数拟合

fit_svm <- svm(
  class~.,
  data=df_train,
  gamma=.01,
  cost=1
)

预测

svm_predict <- predict(
  fit_svm,
  na.omit(df_validate)
)

列联表

svm_perf <- table(
  na.omit(df_validate)$class,
  svm_predict,
  dnn=c("Actual", "Predicted")
)
svm_perf
           Predicted
Actual      benign malignant
  benign       128         7
  malignant      0        70

选择预测效果最好的解

统计量解释
敏感度/正例覆盖率/召回率正类的样本单元被成功预测的概率
特异性/负例覆盖率负类的样本单元被成功预测的概率
正例命中率/精确度被预测为正类的样本单元中,预测正确的样本单元占比
负例命中率被预测为负类的样本单元中,预测正确的样本单元占比
准确率/ACC被正确分类的样本单元所占比重

评估二分类准确性

performance <- function(table, n=2) {
  if (!all(dim(table) == c(2, 2)))
    stop("Must be a 2 x 2 table")

  tn <- table[1, 1]
  fp <- table[1, 2]
  fn <- table[2, 1]
  tp <- table[2, 2]
  
  sensitivity <- tp/(tp + fn)
  specificity <- tn/(tn + fp)
  ppp <- tp/(tp + fp)
  npp <- tn/(tn + fn)
  hitrate <- (tp + tn)/(tp + tn + fp + fn)
  result <- paste("Sensitivity = ", round(sensitivity, n),
                  "\nSpecificity = ", round(specificity, n),
                  "\nPositive Predictive Value = ", round(ppp, n),
                  "\nNegative Predictive Value = ", round(npp, n),
                  "\nAccuracy = ", round(hitrate, n), "\n", sep="")
  cat(result)
}

performance() 函数用于上述五个分类器

performance(logit_perfect)
Sensitivity = 0.99
Specificity = 0.96
Positive Predictive Value = 0.93
Negative Predictive Value = 0.99
Accuracy = 0.97
performance(dtree_pref)
Sensitivity = 0.97
Specificity = 0.93
Positive Predictive Value = 0.87
Negative Predictive Value = 0.98
Accuracy = 0.94
performance(ctree_perf)
Sensitivity = 0.94
Specificity = 0.94
Positive Predictive Value = 0.89
Negative Predictive Value = 0.97
Accuracy = 0.94
performance(forest_perf)
Sensitivity = 0.97
Specificity = 0.95
Positive Predictive Value = 0.91
Negative Predictive Value = 0.98
Accuracy = 0.96
performance(svm_perf)
Sensitivity = 1
Specificity = 0.95
Positive Predictive Value = 0.91
Negative Predictive Value = 1
Accuracy = 0.97

本章节均使用 0.5 作为阈值,变动阈值的影响可以通过 ROC 曲线来进一步观察

rattle 包进行数据挖掘

library(rattle)
rattle()

参考

https://github.com/perillaroc/r-in-action-study

R 语言实战

图形初阶

基本数据管理

高级数据管理

基本图形

基本统计分析

回归

方差分析

中级绘图

重抽样与自助法

广义线性模型

主成分分析和因子分析

时间序列

聚类