R语言实战:方差分析

目录

本文内容来自《R 语言实战》(R in Action, 2nd),有部分修改

术语速成

组件因子

因变量 vs 自变量

均衡设计 vs 非均衡设计

单因素方差分析 one-way ANOVA,单因素组间方差分析

组内因子

单因素组内方差分析,重复测量方差分析

主效应 vs 交互效应

因素方差分析设计

混合模型方差分析

混淆因素 confounding factor

干扰变数 nuisance variable

协变量

协方差分析 ANCOVA

多元方差分析 MANOVA

多元协方差分析 MANCOVA

ANOVA 模型拟合

广义线性模型的特例

aov() 函数

定量变量

组别因子

标识变量

表达式中各项的顺序

类型1:序贯型(默认)

类型2:分层型

类型3:边界型

car 包的 Anova() 函数

单因素方差分析

library(multcomp)
head(cholesterol)
    trt response
1 1time   3.8612
2 1time  10.3868
3 1time   5.9059
4 1time   3.0609
5 1time   7.7204
6 1time   2.7139
attach(cholesterol)
table(trt)
trt
 1time 2times 4times  drugD  drugE 
    10     10     10     10     10
aggregate(
  response,
  by=list(trt),
  FUN=mean
)
  Group.1        x
1   1time  5.78197
2  2times  9.22497
3  4times 12.37478
4   drugD 15.36117
5   drugE 20.94752
aggregate(
  response,
  by=list(trt),
  FUN=sd
)
  Group.1        x
1   1time 2.878113
2  2times 3.483054
3  4times 2.923119
4   drugD 3.454636
5   drugE 3.345003
fit <- aov(
  response ~ trt
)
summary(fit)
            Df Sum Sq Mean Sq F value   Pr(>F)    
trt          4 1351.4   337.8   32.43 9.82e-13 ***
Residuals   45  468.8    10.4                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
library(gplots)

绘制各组均值及置信区间

plotmeans(
  response ~ trt,
  xlab="Treatment",
  ylab="Response",
  main="Mean Plot\nwith 95% CI"
)

detach(cholesterol)

多重比较

TukeyHSD() 函数,成对检验

TukeyHSD(fit)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = response ~ trt)

$trt
                  diff        lwr       upr     p adj
2times-1time   3.44300 -0.6582817  7.544282 0.1380949
4times-1time   6.59281  2.4915283 10.694092 0.0003542
drugD-1time    9.57920  5.4779183 13.680482 0.0000003
drugE-1time   15.16555 11.0642683 19.266832 0.0000000
4times-2times  3.14981 -0.9514717  7.251092 0.2050382
drugD-2times   6.13620  2.0349183 10.237482 0.0009611
drugE-2times  11.72255  7.6212683 15.823832 0.0000000
drugD-4times   2.98639 -1.1148917  7.087672 0.2512446
drugE-4times   8.57274  4.4714583 12.674022 0.0000037
drugE-drugD    5.58635  1.4850683  9.687632 0.0030633
par(las=2)
par(mar=c(5, 8, 4, 2))
plot(TukeyHSD(fit))

multcomp 包中的 glht() 函数

tuk <- glht(
  fit,
  linfct=mcp(trt="Tukey")
)
summary(tuk)
	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = response ~ trt)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)    
2times - 1time == 0     3.443      1.443   2.385  0.13810    
4times - 1time == 0     6.593      1.443   4.568  < 0.001 ***
drugD - 1time == 0      9.579      1.443   6.637  < 0.001 ***
drugE - 1time == 0     15.166      1.443  10.507  < 0.001 ***
4times - 2times == 0    3.150      1.443   2.182  0.20508    
drugD - 2times == 0     6.136      1.443   4.251  < 0.001 ***
drugE - 2times == 0    11.723      1.443   8.122  < 0.001 ***
drugD - 4times == 0     2.986      1.443   2.069  0.25126    
drugE - 4times == 0     8.573      1.443   5.939  < 0.001 ***
drugE - drugD == 0      5.586      1.443   3.870  0.00304 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
par(mar=c(5, 4, 6, 2))
plot(
  cld(tuk, level=.05),
  col="lightgray"
)

评估检验的假设条件

假设因变量服从正态分布,各组方差相等

Q-Q 图检验正态性假设

library(car)
qqPlot(
  lm(response ~ trt, data=cholesterol),
  simulate=TRUE,
  main="Q-Q Plot",
  labels=FALSE
)
[1] 19 38

方差齐性检验

Bartlett 检验

bartlett.test(
  response ~ trt,
  data=cholesterol
)
	Bartlett test of homogeneity of variances

data:  response by trt
Bartlett's K-squared = 0.57975, df = 4, p-value = 0.9653

Fligner-Killeen 检验

fligner.test(
  response ~ trt,
  data=cholesterol
)
	Fligner-Killeen test of homogeneity of variances

data:  response by trt
Fligner-Killeen:med chi-squared = 0.74277, df = 4, p-value = 0.946

Brown-Forsythe 检验

library(HH)
hov(
  response ~ trt,
  data=cholesterol
)
	hov: Brown-Forsyth

data:  response
F = 0.075477, df:trt = 4, df:Residuals = 45, p-value = 0.9893
alternative hypothesis: variances are not identical

car 包的 outlierTest() 函数检测离群点

outlierTest(fit)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
   rstudent unadjusted p-value Bonferroni p
19 2.251149           0.029422           NA

单因素协方差分析

ANCOVA

multcomp 包的 litter 数据集

head(litter)
  dose weight gesttime number
1    0  28.05     22.5     15
2    0  33.33     22.5     14
3    0  36.37     22.0     14
4    0  35.52     22.0     13
5    0  36.77     21.5     15
6    0  29.60     23.0      5
attach(litter)
table(dose)
dose
  0   5  50 500 
 20  19  18  17
aggregate(
  weight,
  by=list(dose),
  FUN=mean
)
  Group.1        x
1       0 32.30850
2       5 29.30842
3      50 29.86611
4     500 29.64647
fit <- aov(
  weight ~ gesttime + dose
)
summary(fit)
            Df Sum Sq Mean Sq F value  Pr(>F)   
gesttime     1  134.3  134.30   8.049 0.00597 **
dose         3  137.1   45.71   2.739 0.04988 * 
Residuals   69 1151.3   16.69                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

effects 包中的 effect() 函数计算调整均值

library(effects)
effect("dose", fit)
 dose effect
dose
       0        5       50      500 
32.35367 28.87672 30.56614 29.33460 

multcomp 包的 glht() 函数

constrast <- rbind(
  "no drug vs. drug" = c(3, -1, -1, -1)
)
summary(
  glht(
    fit,
    linfct=mcp(dose=constrast)
  )
)
	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: User-defined Contrasts


Fit: aov(formula = weight ~ gesttime + dose)

Linear Hypotheses:
                      Estimate Std. Error t value Pr(>|t|)  
no drug vs. drug == 0    8.284      3.209   2.581    0.012 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

评估检验的假设条件

ANCOVA 需要正态性和同方差性假设,同时还假定回归效率相同

fit2 <- aov(
  weight ~ gesttime*dose,
  data=litter
)
summary(fit2)
              Df Sum Sq Mean Sq F value  Pr(>F)   
gesttime       1  134.3  134.30   8.289 0.00537 **
dose           3  137.1   45.71   2.821 0.04556 * 
gesttime:dose  3   81.9   27.29   1.684 0.17889   
Residuals     66 1069.4   16.20                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

交互项不显著,支持斜率相等的假设

结果可视化

HH 包中的 ancova() 函数

library(HH)
ancova(
  weight ~ gesttime + dose,
  data=litter
)
Analysis of Variance Table

Response: weight
          Df  Sum Sq Mean Sq F value   Pr(>F)   
gesttime   1  134.30 134.304  8.0493 0.005971 **
dose       3  137.12  45.708  2.7394 0.049883 * 
Residuals 69 1151.27  16.685                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

detach(litter)

双因素方差分析

head(ToothGrowth)
   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5
attach(ToothGrowth)
table(supp, dose)
    dose
supp 0.5  1  2
  OJ  10 10 10
  VC  10 10 10
aggregate(
  len,
  by=list(supp, dose),
  FUN=mean
)
  Group.1 Group.2     x
1      OJ     0.5 13.23
2      VC     0.5  7.98
3      OJ     1.0 22.70
4      VC     1.0 16.77
5      OJ     2.0 26.06
6      VC     2.0 26.14
aggregate(
  len,
  by=list(supp, dose),
  FUN=sd
)
  Group.1 Group.2        x
1      OJ     0.5 4.459709
2      VC     0.5 2.746634
3      OJ     1.0 3.910953
4      VC     1.0 2.515309
5      OJ     2.0 2.655058
6      VC     2.0 4.797731
dose <- factor(dose)
fit <- aov(len ~ supp*dose)
summary(fit)
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4  15.572 0.000231 ***
dose         2 2426.4  1213.2  92.000  < 2e-16 ***
supp:dose    2  108.3    54.2   4.107 0.021860 *  
Residuals   54  712.1    13.2                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
interaction.plot(
  dose, supp, len,
  type="b",
  col=c("red", "blue"),
  pch=c(16, 18),
  main="Interaction between Dose and Supplement Type"
)

gplots 包中的 plotmeans() 函数展示交互作用

plotmeans(
  len ~ interaction(supp, dose, sep=" "),
  connect=list(c(1, 3, 5), c(2, 4, 6)),
  col=c("red", "darkgreen"),
  main="Interation Plot with 95% CIs",
  xlab="Treatment and Dose Combination"
)

HH 包的 interaction2wt() 函数,对任意顺序的因子设计主效应和交互效应

interaction2wt(len ~ supp*dose, data=ToothGrowth)

重复测量方差分析

head(CO2)
Grouped Data: uptake ~ conc | Plant
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
CO2$conc <- factor(CO2$conc)
w1b1 <- subset(CO2, Treatment=="chilled")
head(w1b1)
Grouped Data: uptake ~ conc | Plant
   Plant   Type Treatment conc uptake
22   Qc1 Quebec   chilled   95   14.2
23   Qc1 Quebec   chilled  175   24.1
24   Qc1 Quebec   chilled  250   30.3
25   Qc1 Quebec   chilled  350   34.6
26   Qc1 Quebec   chilled  500   32.5
27   Qc1 Quebec   chilled  675   35.4
table(w1b1$Plant)
Qc1 Qc3 Qc2 Mc2 Mc3 Mc1 
  7   7   7   7   7   7 

含一个组间因子(Type)和一个组内因子(conc)的重复测量方差分析

自变量:conc 和 Type

因变量:uptake

标识变量:Plant,表示重复测量

fit <- aov(
  uptake ~ conc * Type + Error(Plant/(conc)),
  w1b1
)
summary(fit)
Error: Plant
          Df Sum Sq Mean Sq F value  Pr(>F)   
Type       1 2667.2  2667.2   60.41 0.00148 **
Residuals  4  176.6    44.1                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error: Plant:conc
          Df Sum Sq Mean Sq F value   Pr(>F)    
conc       6 1472.4  245.40   52.52 1.26e-12 ***
conc:Type  6  428.8   71.47   15.30 3.75e-07 ***
Residuals 24  112.1    4.67                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
par(las=2)
par(mar=c(10, 4, 4, 2))
with(
  w1b1,
  interaction.plot(
    conc, Type, uptake,
    type="b",
    col=c("red", "blue"),
    pch=c(16, 18),
    main="Interaction Plot for Plant Type and Concentration"
  )
)

par(las=2)
par(mar=c(10, 4, 4, 2))
boxplot(
  uptake ~ Type*conc,
  data=w1b1,
  col=c("gold", "green"),
  main="Chilled Qubec and Mississippi Plants",
  ylab="Carbon dioxide uptake rate (umol/m^2 sec)",
  xlab=""
)

宽格式 wide format

列是变量,行是观测,一行一个受试对象

head(litter)
  dose weight gesttime number
1    0  28.05     22.5     15
2    0  33.33     22.5     14
3    0  36.37     22.0     14
4    0  35.52     22.0     13
5    0  36.77     21.5     15
6    0  29.60     23.0      5

长格式 long format

因变量的每次观测放到独立的行中

head(CO2)
Grouped Data: uptake ~ conc | Plant
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2

多元方差分析

MANOVA

MASS 库的 UScereal 数据集

library(MASS)
head(UScereal)
                          mfr calories   protein      fat   sodium     fibre    carbo
100% Bran                   N 212.1212 12.121212 3.030303 393.9394 30.303030 15.15152
All-Bran                    K 212.1212 12.121212 3.030303 787.8788 27.272727 21.21212
All-Bran with Extra Fiber   K 100.0000  8.000000 0.000000 280.0000 28.000000 16.00000
Apple Cinnamon Cheerios     G 146.6667  2.666667 2.666667 240.0000  2.000000 14.00000
Apple Jacks                 K 110.0000  2.000000 0.000000 125.0000  1.000000 11.00000
Basic 4                     G 173.3333  4.000000 2.666667 280.0000  2.666667 24.00000
                            sugars shelf potassium vitamins
100% Bran                 18.18182     3 848.48485 enriched
All-Bran                  15.15151     3 969.69697 enriched
All-Bran with Extra Fiber  0.00000     3 660.00000 enriched
Apple Cinnamon Cheerios   13.33333     1  93.33333 enriched
Apple Jacks               14.00000     2  30.00000 enriched
Basic 4                   10.66667     3 133.33333 enriched
attach(UScereal)
shelf <- factor(shelf)
y <- cbind(
  calories, fat, sugars
)
head(y)
     calories      fat   sugars
[1,] 212.1212 3.030303 18.18182
[2,] 212.1212 3.030303 15.15151
[3,] 100.0000 0.000000  0.00000
[4,] 146.6667 2.666667 13.33333
[5,] 110.0000 0.000000 14.00000
[6,] 173.3333 2.666667 10.66667
aggregate(
  y,
  by=list(shelf),
  FUN=mean
)
  Group.1 calories       fat    sugars
1       1 119.4774 0.6621338  6.295493
2       2 129.8162 1.3413488 12.507670
3       3 180.1466 1.9449071 10.856821
cov(y)
           calories       fat     sugars
calories 3895.24210 60.674383 180.380317
fat        60.67438  2.713399   3.995474
sugars    180.38032  3.995474  34.050018

manova() 对组间差异进行多元检验

fit <- manova(y ~ shelf)
summary(fit)
          Df Pillai approx F num Df den Df    Pr(>F)    
shelf      2 0.4021   5.1167      6    122 0.0001015 ***
Residuals 62                                            
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

summary.aov() 函数对每一个变量做单因素方差分析

summary.aov(fit)
 Response calories :
            Df Sum Sq Mean Sq F value    Pr(>F)    
shelf        2  50435 25217.6  7.8623 0.0009054 ***
Residuals   62 198860  3207.4                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Response fat :
            Df Sum Sq Mean Sq F value  Pr(>F)  
shelf        2  18.44  9.2199  3.6828 0.03081 *
Residuals   62 155.22  2.5035                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Response sugars :
            Df  Sum Sq Mean Sq F value   Pr(>F)   
shelf        2  381.33 190.667  6.5752 0.002572 **
Residuals   62 1797.87  28.998                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

评估假设检验

多元正态性

方差-协方差矩阵同质性

多元正态性:Q-Q 图

center <- colMeans(y)
cov <- cov(y)
d <- mahalanobis(y, center, cov)
n <- nrow(y)
p <- ncol(y)
coord <- qqplot(
  qchisq(
    ppoints(n),
    df=p
  ),
  d,
  main="Q-Q Plot Assessing Multivariate Normality",
  ylab="Mahalanobis D2"
)
abline(a=0, b=1)

方差-协方差同质性:Box’s M 检验

多元离群点:mvoutlier 包中的 ap.plot() 函数

library(mvoutlier)
outliers <- aq.plot(y)
outliers

稳健多元方差分析

稳健单因素 MANOVA:rrcov 包中的 Wilks.test() 函数

非参数 MANOVA:vegan 包中的 adonis() 函数

library(rrcov)
Wilks.test(
  y,
  shelf,
  method="mcd"
)
	Robust One-way MANOVA (Bartlett Chi2)

data:  x
Wilks' Lambda = 0.51073, Chi2-Value = 21.979, DF = 4.575, p-value = 0.0003573
sample estimates:
  calories       fat    sugars
1 119.8210 0.7010828  5.663143
2 128.0407 1.1849576 12.537533
3 160.8604 1.6524559 10.352646

用回归来做 ANOVA

levels(cholesterol$trt)
[1] "1time"  "2times" "4times" "drugD"  "drugE" 
fit_aov <- aov(
  response ~ trt,
  data=cholesterol
)
summary(fit_aov)
            Df Sum Sq Mean Sq F value   Pr(>F)    
trt          4 1351.4   337.8   32.43 9.82e-13 ***
Residuals   45  468.8    10.4                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
fit_lm <- lm(
  response ~ trt,
  data=cholesterol
)
summary(fit_lm)
Call:
lm(formula = response ~ trt, data = cholesterol)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5418 -1.9672 -0.0016  1.8901  6.6008 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.782      1.021   5.665 9.78e-07 ***
trt2times      3.443      1.443   2.385   0.0213 *  
trt4times      6.593      1.443   4.568 3.82e-05 ***
trtdrugD       9.579      1.443   6.637 3.53e-08 ***
trtdrugE      15.166      1.443  10.507 1.08e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.227 on 45 degrees of freedom
Multiple R-squared:  0.7425,	Adjusted R-squared:  0.7196 
F-statistic: 32.43 on 4 and 45 DF,  p-value: 9.819e-13
contrasts(cholesterol$trt)
       2times 4times drugD drugE
1time       0      0     0     0
2times      1      0     0     0
4times      0      1     0     0
drugD       0      0     1     0
drugE       0      0     0     1
fit_lm <- lm(
  response ~ trt,
  data=cholesterol,
  contrasts=c("contr.SAS", "contr.helmert")
)
summary(fit_lm)
non-list contrasts argument ignored
Call:
lm(formula = response ~ trt, data = cholesterol, contrasts = c("contr.SAS", 
    "contr.helmert"))

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5418 -1.9672 -0.0016  1.8901  6.6008 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.782      1.021   5.665 9.78e-07 ***
trt2times      3.443      1.443   2.385   0.0213 *  
trt4times      6.593      1.443   4.568 3.82e-05 ***
trtdrugD       9.579      1.443   6.637 3.53e-08 ***
trtdrugE      15.166      1.443  10.507 1.08e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.227 on 45 degrees of freedom
Multiple R-squared:  0.7425,	Adjusted R-squared:  0.7196 
F-statistic: 32.43 on 4 and 45 DF,  p-value: 9.819e-13

参考

https://github.com/perillaroc/r-in-action-study

R 语言实战

图形初阶

基本数据管理

高级数据管理

基本图形

基本统计分析

回归