R语言实战:基本统计分析

目录

本文内容来自《R 语言实战》(R in Action, 2nd),有部分修改

描述性统计分析

mtcars 数据集中的三个连续变量

mpg:每加仑汽油行驶英里数

hp:马力

wt:车重

selected_variables <- c("mpg", "hp", "wt")
head(mtcars[selected_variables])
                   mpg  hp    wt
Mazda RX4         21.0 110 2.620
Mazda RX4 Wag     21.0 110 2.875
Datsun 710        22.8  93 2.320
Hornet 4 Drive    21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant           18.1 105 3.460

两个分类变量

am:变速箱类型

cyl:汽缸数

方法云集

summary(mtcars[selected_variables])
      mpg              hp              wt       
 Min.   :10.40   Min.   : 52.0   Min.   :1.513  
 1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
 Median :19.20   Median :123.0   Median :3.325  
 Mean   :20.09   Mean   :146.7   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
 Max.   :33.90   Max.   :335.0   Max.   :5.424  

sapply() 函数格式

sapply(x, FUN, options)

fivenum() 返回图基五数总括 (Tukey’s five-number summary)

sapply(
  mtcars[selected_variables],
  fivenum
)
       mpg  hp     wt
[1,] 10.40  52 1.5130
[2,] 15.35  96 2.5425
[3,] 19.20 123 3.3250
[4,] 22.80 180 3.6500
[5,] 33.90 335 5.4240

自定义统计函数

在统计学中,峰度(Kurtosis)衡量实数随机变量概率分布的峰态。 峰度高就意味着方差增大是由低频度的大于或小于平均值的极端差值引起的。

– wiki

在概率论和统计学中,偏度衡量实数随机变量概率分布的不对称性。

– wiki

my_stats <- function(x, na.omit=TRUE) {
  if(na.omit) {
    x <- x[!is.na(x)]
  }
  m <- mean(x)
  n <- length(x)
  s <- sd(x)
  skew <- sum((x-m)^3/s^3)/n # 偏度 Skewness
  kurt <- sum((x-m)^4/s^4)/n - 3 # 峰度 Kurtosis
  return (
    c(
      n=n,
      mean=m,
      stdev=s,
      skew=skew,
      kurtosis=kurt
    )
  )
}
sapply(mtcars[selected_variables], my_stats)
               mpg          hp          wt
n        32.000000  32.0000000 32.00000000
mean     20.090625 146.6875000  3.21725000
stdev     6.026948  68.5628685  0.97845744
skew      0.610655   0.7260237  0.42314646
kurtosis -0.372766  -0.1355511 -0.02271075

更多方法

Hmisc 包中的 describe() 函数

library(Hmisc)
describe(mtcars[selected_variables])
mtcars[selected_variables] 

 3  Variables      32  Observations
---------------------------------------------------------------------------------------------------------
mpg 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75 
      32        0       25    0.999    20.09    6.796    12.00    14.34    15.43    19.20    22.80 
     .90      .95 
   30.09    31.30 

lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
---------------------------------------------------------------------------------------------------------
hp 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75 
      32        0       22    0.997    146.7    77.04    63.65    66.00    96.50   123.00   180.00 
     .90      .95 
  243.50   253.55 

lowest :  52  62  65  66  91, highest: 215 230 245 264 335
---------------------------------------------------------------------------------------------------------
wt 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75 
      32        0       29    0.999    3.217    1.089    1.736    1.956    2.581    3.325    3.610 
     .90      .95 
   4.048    5.293 

lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424
---------------------------------------------------------------------------------------------------------

pastecs 包中的 stat.desc() 函数

library(pastecs)
stat.desc(mtcars[selected_variables])
                     mpg           hp          wt
nbr.val       32.0000000   32.0000000  32.0000000
nbr.null       0.0000000    0.0000000   0.0000000
nbr.na         0.0000000    0.0000000   0.0000000
min           10.4000000   52.0000000   1.5130000
max           33.9000000  335.0000000   5.4240000
range         23.5000000  283.0000000   3.9110000
sum          642.9000000 4694.0000000 102.9520000
median        19.2000000  123.0000000   3.3250000
mean          20.0906250  146.6875000   3.2172500
SE.mean        1.0654240   12.1203173   0.1729685
CI.mean.0.95   2.1729465   24.7195501   0.3527715
var           36.3241028 4700.8669355   0.9573790
std.dev        6.0269481   68.5628685   0.9784574
coef.var       0.2999881    0.4674077   0.3041285

psych 包中的 describe() 函数

library(psych)

注:可以使用 包名::函数名 的方式访问被覆盖的函数

psych::describe(mtcars[selected_variables])
    vars  n   mean    sd median trimmed   mad   min    max  range skew kurtosis    se
mpg    1 32  20.09  6.03  19.20   19.70  5.41 10.40  33.90  23.50 0.61    -0.37  1.07
hp     2 32 146.69 68.56 123.00  141.19 77.10 52.00 335.00 283.00 0.73    -0.14 12.12
wt     3 32   3.22  0.98   3.33    3.15  0.77  1.51   5.42   3.91 0.42    -0.02  0.17

分组计算描述性统计量

aggregate() 函数 by 参数

aggregate(
  mtcars[selected_variables],
  by=list(am=mtcars$am),
  mean
)
  am      mpg       hp       wt
1  0 17.14737 160.2632 3.768895
2  1 24.39231 126.8462 2.411000
aggregate(
  mtcars[selected_variables],
  by=list(am=mtcars$am),
  sd
)
  am      mpg       hp        wt
1  0 3.833966 53.90820 0.7774001
2  1 6.166504 84.06232 0.6169816

by() 函数

dstats <- function(x) {
  return (sapply(x, my_stats))
}

by(
  mtcars[selected_variables],
  mtcars$am,
  dstats
)
mtcars$am: 0
                 mpg           hp         wt
n        19.00000000  19.00000000 19.0000000
mean     17.14736842 160.26315789  3.7688947
stdev     3.83396639  53.90819573  0.7774001
skew      0.01395038  -0.01422519  0.9759294
kurtosis -0.80317826  -1.20969733  0.1415676
------------------------------------------------------------------------------ 
mtcars$am: 1
                 mpg          hp         wt
n        13.00000000  13.0000000 13.0000000
mean     24.39230769 126.8461538  2.4110000
stdev     6.16650381  84.0623243  0.6169816
skew      0.05256118   1.3598859  0.2103128
kurtosis -1.45535200   0.5634635 -1.1737358

分组计算的扩展

doBy 包中的 summaryBy() 函数

var1 + var2 + var3 + … + varN ~ groupvar1 + groupvar2 + … + groupvarN

library(doBy)
summaryBy(
  mpg + hp + wt ~ am,
  data=mtcars,
  FUN=my_stats
)
  am mpg.n mpg.mean mpg.stdev   mpg.skew mpg.kurtosis hp.n  hp.mean hp.stdev     hp.skew hp.kurtosis
1  0    19 17.14737  3.833966 0.01395038   -0.8031783   19 160.2632 53.90820 -0.01422519  -1.2096973
2  1    13 24.39231  6.166504 0.05256118   -1.4553520   13 126.8462 84.06232  1.35988586   0.5634635
  wt.n  wt.mean  wt.stdev   wt.skew wt.kurtosis
1   19 3.768895 0.7774001 0.9759294   0.1415676
2   13 2.411000 0.6169816 0.2103128  -1.1737358

psych 包中的 describeBy() 函数

describeBy(
  mtcars[selected_variables],
  list(am=mtcars$am)
)
 Descriptive statistics by group 
am: 0
    vars  n   mean    sd median trimmed   mad   min    max  range  skew kurtosis    se
mpg    1 19  17.15  3.83  17.30   17.12  3.11 10.40  24.40  14.00  0.01    -0.80  0.88
hp     2 19 160.26 53.91 175.00  161.06 77.10 62.00 245.00 183.00 -0.01    -1.21 12.37
wt     3 19   3.77  0.78   3.52    3.75  0.45  2.46   5.42   2.96  0.98     0.14  0.18
------------------------------------------------------------------------------ 
am: 1
    vars  n   mean    sd median trimmed   mad   min    max  range skew kurtosis    se
mpg    1 13  24.39  6.17  22.80   24.38  6.67 15.00  33.90  18.90 0.05    -1.46  1.71
hp     2 13 126.85 84.06 109.00  114.73 63.75 52.00 335.00 283.00 1.36     0.56 23.31
wt     3 13   2.41  0.62   2.32    2.39  0.68  1.51   3.57   2.06 0.21    -1.17  0.17

结果的可视化

直方图

密度图

箱线图

点图

频数表和列联表

使用 vcd 包中的 Arthritis 数据集

library(vcd)
head(Arthritis)
  ID Treatment  Sex Age Improved
1 57   Treated Male  27     Some
2 46   Treated Male  29     None
3 77   Treated Male  30     None
4 17   Treated Male  32   Marked
5 36   Treated Male  46   Marked
6 23   Treated Male  58   Marked

生成频数表

一维列联表

table() 生成频数统计表

one_d_table <- with(
  Arthritis,
  table(Improved)
)
one_d_table
Improved
  None   Some Marked 
    42     14     28 

prop.table() 将频数统计表转为比例值

prop.table(my_table)
Improved
     None      Some    Marked 
0.5000000 0.1666667 0.3333333 
prop.table(my_table) * 100
Improved
    None     Some   Marked 
50.00000 16.66667 33.33333 

二维列联表

table(
  Arthritis$Treatment, # 行
  Arthritis$Improved   # 列
)
          None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

xtabs() 函数

my_table <- xtabs(
  ~ Treatment + Improved,
  data=Arthritis
)
my_table
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

margin.table() 生成边际频数

1 表示每行生成一个边际值,即为第 1 个维度生成边际值

margin.table(my_table, 1)
Treatment
Placebo Treated 
     43      41 
margin.table(my_table, 2)
Improved
  None   Some Marked 
    42     14     28

prop.table() 生成比例

1 表示沿行生成比例,即沿第一个维度计算比例

prop.table(my_table, 1)
         Improved
Treatment      None      Some    Marked
  Placebo 0.6744186 0.1627907 0.1627907
  Treated 0.3170732 0.1707317 0.5121951
prop.table(my_table, 2)
         Improved
Treatment      None      Some    Marked
  Placebo 0.6744186 0.1627907 0.1627907
  Treated 0.3170732 0.1707317 0.5121951

addmargins() 添加边际和

addmargins(my_table)
         Improved
Treatment None Some Marked Sum
  Placebo   29    7      7  43
  Treated   13    7     21  41
  Sum       42   14     28  84

1 表示行,2 表示列。

下面代码中的 1 表示按行求比例,即每行所有数值加和为 1

2 表示添加列方向的累加和,也就是为每行添加一个累加值

addmargins(
  prop.table(
    my_table, 
    1
  ),
  2
)
         Improved
Treatment      None      Some    Marked       Sum
  Placebo 0.6744186 0.1627907 0.1627907 1.0000000
  Treated 0.3170732 0.1707317 0.5121951 1.0000000
addmargins(
  prop.table(
    my_table, 
    2
  ), 
  1
)
         Improved
Treatment      None      Some    Marked
  Placebo 0.6904762 0.5000000 0.2500000
  Treated 0.3095238 0.5000000 0.7500000
  Sum     1.0000000 1.0000000 1.0000000

gmodels 包中的 CrossTable() 函数

library(gmodels)
CrossTable(
  Arthritis$Treatment, 
  Arthritis$Improved
)
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  84 

 
                    | Arthritis$Improved 
Arthritis$Treatment |      None |      Some |    Marked | Row Total | 
--------------------|-----------|-----------|-----------|-----------|
            Placebo |        29 |         7 |         7 |        43 | 
                    |     2.616 |     0.004 |     3.752 |           | 
                    |     0.674 |     0.163 |     0.163 |     0.512 | 
                    |     0.690 |     0.500 |     0.250 |           | 
                    |     0.345 |     0.083 |     0.083 |           | 
--------------------|-----------|-----------|-----------|-----------|
            Treated |        13 |         7 |        21 |        41 | 
                    |     2.744 |     0.004 |     3.935 |           | 
                    |     0.317 |     0.171 |     0.512 |     0.488 | 
                    |     0.310 |     0.500 |     0.750 |           | 
                    |     0.155 |     0.083 |     0.250 |           | 
--------------------|-----------|-----------|-----------|-----------|
       Column Total |        42 |        14 |        28 |        84 | 
                    |     0.500 |     0.167 |     0.333 |           | 
--------------------|-----------|-----------|-----------|-----------|

多维列联表

my_table <- xtabs(
  ~ Treatment + Sex + Improved,
  data=Arthritis
)
my_table
, , Improved = None

         Sex
Treatment Female Male
  Placebo     19   10
  Treated      6    7

, , Improved = Some

         Sex
Treatment Female Male
  Placebo      7    0
  Treated      5    2

, , Improved = Marked

         Sex
Treatment Female Male
  Placebo      6    1
  Treated     16    5

ftable() 以一种更紧凑的形式输出多维列联表

ftable(my_table)
                 Improved None Some Marked
Treatment Sex                             
Placebo   Female            19    7      6
          Male              10    0      1
Treated   Female             6    5     16
          Male               7    2      5

边际频数

margin.table(
  my_table,
  1
)
Treatment
Placebo Treated 
     43      41 
margin.table(
  my_table,
  2
)
Sex
Female   Male 
    59     25 
margin.table(
  my_table,
  3
)
Improved
  None   Some Marked 
    42     14     28 

多维边际频数

margin.table(
  my_table, 
  c(1, 3)
)
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

比例

ftable(
  prop.table(
    my_table, 
    c(1, 2)
  )
)
                 Improved       None       Some     Marked
Treatment Sex                                             
Placebo   Female          0.59375000 0.21875000 0.18750000
          Male            0.90909091 0.00000000 0.09090909
Treated   Female          0.22222222 0.18518519 0.59259259
          Male            0.50000000 0.14285714 0.35714286
ftable(
  addmargins(
    prop.table(
      my_table, 
      c(1, 2)
    ),
    3  # 为第三维 (Improved) 增加列加和
  )
) * 100
                 Improved       None       Some     Marked        Sum
Treatment Sex                                                        
Placebo   Female           59.375000  21.875000  18.750000 100.000000
          Male             90.909091   0.000000   9.090909 100.000000
Treated   Female           22.222222  18.518519  59.259259 100.000000
          Male             50.000000  14.285714  35.714286 100.000000

独立性检验

卡方独立检验

卡方检验适用于计数数据,可以检验数据与预期分布的拟合程度。 在统计实践中,卡方统计量的最常见用法是与 r x c 列联表一起使用,以评估对变量间独立性的零假设是否合理。

引自 [1]

chisq.test()

下面示例显示治疗情况和改善情况不独立 (p值太小)

my_table <- xtabs(
  ~ Treatment + Improved,
  data=Arthritis
)
my_table
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21
chisq.test(my_table)
	Pearson's Chi-squared test

data:  my_table
X-squared = 13.055, df = 2, p-value = 0.001463

p 值表示从总体中抽取的样本行变量与列变量是相互独立的概率。

下面示例显示性别和改善情况独立

my_table <- xtabs(
  ~ Improved + Sex,
  data=Arthritis
)
my_table
        Sex
Improved Female Male
  None       25   17
  Some       12    2
  Marked     22    6
chisq.test(my_table)
Chi-squared approximation may be incorrect
	Pearson's Chi-squared test

data:  my_table
X-squared = 4.8407, df = 2, p-value = 0.08889

Fisher 精确检验

可以实际列出所有可能出现的重排 (置换) 情况及其频数,进而确定观测结果的极端程度。 这一操作被称为费舍尔精确检验 (Fisher’s exact test)。

引自 [1]

my_table <- xtabs(
  ~ Treatment + Improved,
  data=Arthritis
)
my_table
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21
fisher.test(my_table)
	Fisher's Exact Test for Count Data

data:  my_table
p-value = 0.001393
alternative hypothesis: two.sided

Cochran-Mantel-Haenszel 卡方检验

假设两个名义变量在第三个变量的每一层中都是条件独立的。

下面代码假设不存在三阶交互作用 (治疗情况 x 改善情况 x 性别)

my_table <- xtabs(
  ~ Treatment + Improved + Sex,
  data=Arthritis
)
ftable(my_table)
                   Sex Female Male
Treatment Improved                
Placebo   None             19   10
          Some              7    0
          Marked            6    1
Treated   None              6    7
          Some              5    2
          Marked           16    5

结果表明,治疗与得到的改善在性别的每一水平下并不独立

mantelhaen.test(my_table)
	Cochran-Mantel-Haenszel test

data:  my_table
Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647

相关性的度量

vcd 包的 assocstats() 函数

计算二维列联表的 phi 系数,列联系数和 Cramer’s V 系数

my_table <- xtabs(
  ~ Treatment + Improved,
  data=Arthritis
)
my_table
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21
assocstats(my_table)
                    X^2 df  P(> X^2)
Likelihood Ratio 13.530  2 0.0011536
Pearson          13.055  2 0.0014626

Phi-Coefficient   : NA 
Contingency Coeff.: 0.367 
Cramer's V        : 0.394 

结果的可视化

条形图

马赛克图

关联图

相关

相关系数

head(state.x77)
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

相关的类型

相关系数

Pearson 积差相关系数衡量两个定量变量之间的线性相关程度。

Spearman 等级相关系数衡量分级定序变量之间的相关程度。

Kendall’s Tau 相关系数是一种非参数的等级相关度量。

cov() 计算协方差

cor() 计算相关系数

useall.obseverythingcomplete.obspairwise.complete.obs

method: pearsonspearmankendall

默认参数 use="everything", method="pearson"

states <- state.x77[, 1:6]
cov(states)
              Population      Income   Illiteracy     Life Exp      Murder      HS Grad
Population 19931683.7588 571229.7796  292.8679592 -407.8424612 5663.523714 -3551.509551
Income       571229.7796 377573.3061 -163.7020408  280.6631837 -521.894286  3076.768980
Illiteracy      292.8680   -163.7020    0.3715306   -0.4815122    1.581776    -3.235469
Life Exp       -407.8425    280.6632   -0.4815122    1.8020204   -3.869480     6.312685
Murder         5663.5237   -521.8943    1.5817755   -3.8694804   13.627465   -14.549616
HS Grad       -3551.5096   3076.7690   -3.2354694    6.3126849  -14.549616    65.237894

Pearson 积差相关系数

cor(states)
            Population     Income Illiteracy    Life Exp     Murder     HS Grad
Population  1.00000000  0.2082276  0.1076224 -0.06805195  0.3436428 -0.09848975
Income      0.20822756  1.0000000 -0.4370752  0.34025534 -0.2300776  0.61993232
Illiteracy  0.10762237 -0.4370752  1.0000000 -0.58847793  0.7029752 -0.65718861
Life Exp   -0.06805195  0.3402553 -0.5884779  1.00000000 -0.7808458  0.58221620
Murder      0.34364275 -0.2300776  0.7029752 -0.78084575  1.0000000 -0.48797102
HS Grad    -0.09848975  0.6199323 -0.6571886  0.58221620 -0.4879710  1.00000000

Spearman 等级相关系数

cor(states, method="spearman")
           Population     Income Illiteracy   Life Exp     Murder    HS Grad
Population  1.0000000  0.1246098  0.3130496 -0.1040171  0.3457401 -0.3833649
Income      0.1246098  1.0000000 -0.3145948  0.3241050 -0.2174623  0.5104809
Illiteracy  0.3130496 -0.3145948  1.0000000 -0.5553735  0.6723592 -0.6545396
Life Exp   -0.1040171  0.3241050 -0.5553735  1.0000000 -0.7802406  0.5239410
Murder      0.3457401 -0.2174623  0.6723592 -0.7802406  1.0000000 -0.4367330
HS Grad    -0.3833649  0.5104809 -0.6545396  0.5239410 -0.4367330  1.0000000

非方形的相关矩阵

x <- states[, c(
  "Population", 
  "Income",
  "Illiteracy",
  "HS Grad"
)]
y <- states[, c(
  "Life Exp",
  "Murder"
)]
cor(x, y)
              Life Exp     Murder
Population -0.06805195  0.3436428
Income      0.34025534 -0.2300776
Illiteracy -0.58847793  0.7029752
HS Grad     0.58221620 -0.4879710

偏相关

偏相关是指在控制一个或多个定量变量时,另外两个定量变量之间的相互关系。

ggm 包中的 pcor() 函数

library(ggm)
colnames(states)
[1] "Population" "Income"     "Illiteracy" "Life Exp"   "Murder"     "HS Grad"   

u 中前两个数值表示计算相关向量的下标,其余数值为条件变量下标

pcor(u, S)
pcor(
  c(1, 5, 2, 3, 6), 
  cov(states)
)
[1] 0.3462724

其他类型的相关

polycor 包中的 hetcor() 计算一种混合的相关矩阵

library(polycor)
hetcor(states)
Two-Step Estimates

Correlations/Type of Correlation:
           Population  Income Illiteracy Life.Exp  Murder HS.Grad
Population          1 Pearson    Pearson  Pearson Pearson Pearson
Income         0.2082       1    Pearson  Pearson Pearson Pearson
Illiteracy     0.1076 -0.4371          1  Pearson Pearson Pearson
Life.Exp     -0.06805  0.3403    -0.5885        1 Pearson Pearson
Murder         0.3436 -0.2301      0.703  -0.7808       1 Pearson
HS.Grad      -0.09849  0.6199    -0.6572   0.5822  -0.488       1

Standard Errors:
           Population  Income Illiteracy Life.Exp Murder
Population                                              
Income         0.1357                                   
Illiteracy     0.1401  0.1152                           
Life.Exp       0.1411  0.1257    0.09337                
Murder         0.1253  0.1344    0.07247  0.05604       
HS.Grad        0.1404 0.08801    0.08129   0.0944 0.1086

n = 50 

P-values for Tests of Bivariate Normality:
           Population  Income Illiteracy Life.Exp  Murder
Population                                               
Income       0.003105                                    
Illiteracy   0.004307 0.02102                            
Life.Exp      0.01916 0.03275     0.1717                 
Murder         0.1015   0.174     0.2613  0.07242        
HS.Grad      0.007223   0.114   0.005472   0.1134 0.02036

相关性的显著性检验

常用原假设为变量间不相关(即总体的相关系数为0)

cor.test()

alternativetwo.sidelessgreater

methodpearsonkendallspearman

cor.test(states[,3], states[,5])
	Pearson's product-moment correlation

data:  states[, 3] and states[, 5]
t = 6.8479, df = 48, p-value = 1.258e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5279280 0.8207295
sample estimates:
      cor 
0.7029752 

psych 包中 corr.test() 函数

corr.test(
  states, 
  use="complete"
)
Call:corr.test(x = states, use = "complete")
Correlation matrix 
           Population Income Illiteracy Life Exp Murder HS Grad
Population       1.00   0.21       0.11    -0.07   0.34   -0.10
Income           0.21   1.00      -0.44     0.34  -0.23    0.62
Illiteracy       0.11  -0.44       1.00    -0.59   0.70   -0.66
Life Exp        -0.07   0.34      -0.59     1.00  -0.78    0.58
Murder           0.34  -0.23       0.70    -0.78   1.00   -0.49
HS Grad         -0.10   0.62      -0.66     0.58  -0.49    1.00
Sample Size 
[1] 50
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
           Population Income Illiteracy Life Exp Murder HS Grad
Population       0.00   0.59       1.00      1.0   0.10       1
Income           0.15   0.00       0.01      0.1   0.54       0
Illiteracy       0.46   0.00       0.00      0.0   0.00       0
Life Exp         0.64   0.02       0.00      0.0   0.00       0
Murder           0.01   0.11       0.00      0.0   0.00       0
HS Grad          0.50   0.00       0.00      0.0   0.00       0

 To see confidence intervals of the correlations, print with the short=FALSE option

其他显著性检验

ggm 包的 pcor.test() 函数

psych 包的 r.test() 函数

相关关系的可视化

散点图和散点图矩阵

相关图 (correlogram)

t 检验

本节关注的变量为连续型的组间比较,假设其呈正态分布

library(MASS)
head(UScrime)
    M So  Ed Po1 Po2  LF  M.F Pop  NW  U1 U2 GDP Ineq     Prob    Time    y
1 151  1  91  58  56 510  950  33 301 108 41 394  261 0.084602 26.2011  791
2 143  0 113 103  95 583 1012  13 102  96 36 557  194 0.029599 25.2999 1635
3 142  1  89  45  44 533  969  18 219  94 33 318  250 0.083401 24.3006  578
4 136  0 121 149 141 577  994 157  80 102 39 673  167 0.015801 29.9012 1969
5 141  0 121 109 101 591  985  18  30  91 20 578  174 0.041399 21.2998 1234
6 121  0 110 118 115 547  964  25  44  84 29 689  126 0.034201 20.9995  682
  • Prob:监禁的概率
  • U1:14-24 岁年龄段城市男性失业率
  • U2:35-39 岁年龄段城市男性失业率
  • So:分类变量,是否为南方州

独立样本的 t 检验

t.test(
  Prob ~ So,
  data=UScrime
)
	Welch Two Sample t-test

data:  Prob by So
t = -3.8954, df = 24.925, p-value = 0.0006506
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.03852569 -0.01187439
sample estimates:
mean in group 0 mean in group 1 
     0.03851265      0.06371269 

可以拒绝南方各州和非南方各州拥有相同监禁概率的假设 (p < 0.001)

非独立样本的 t 检验

非独立组设计 dependent groups design

sapply(
  UScrime[c("U1", "U2")],
  function(x) (c(mean=mean(x), sd=sd(x)))
)
           U1       U2
mean 95.46809 33.97872
sd   18.02878  8.44545
with(
  UScrime,
  t.test(U1, U2, paried=TRUE)
)
	Welch Two Sample t-test

data:  U1 and U2
t = 21.174, df = 65.261, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 55.69010 67.28862
sample estimates:
mean of x mean of y 
 95.46809  33.97872 

多于两组的情况

方差分析(NOAA):第 9 章

组间差异的非参数检验

两组的比较

两组数据独立,可以使用 Wilcoxon 秩和检验 (Mann-Whitney U 检验) 评估观测是否是从相同的概率分布中抽得的。

with(
  UScrime,
  by(Prob, So, median)
)
So: 0
[1] 0.038201
------------------------------------------------------------------------ 
So: 1
[1] 0.055552
wilcox.test(
  Prob ~ So,
  data=UScrime
)
	Wilcoxon rank sum exact test

data:  Prob by So
W = 81, p-value = 8.488e-05
alternative hypothesis: true location shift is not equal to 0

Wilcoxon 符号秩检验是非独立样本 t 检验的一种非参数替代方法,适用于两组成对数据和无法保证正态性假设的情境。

sapply(
  UScrime[c("U1", "U2")],
  median
)
U1 U2 
92 34 
with(
  UScrime,
  wilcox.test(
    U1, U2,
    paired=TRUE
  )
)
cannot compute exact p-value with ties
	Wilcoxon signed rank test with continuity correction

data:  U1 and U2
V = 1128, p-value = 2.464e-09
alternative hypothesis: true location shift is not equal to 0

多于两组的比较

单向设计 one-way design

各组独立:Kruskal-Wallis 检验

各组不独立:Friedman 检验

states <- data.frame(
  state.region,
  state.x77
)
head(states)
           state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
Alabama           South       3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska             West        365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona            West       2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas          South       2110   3378        1.9    70.66   10.1    39.9    65  51945
California         West      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado           West       2541   4884        0.7    72.06    6.8    63.9   166 103766
kruskal.test(
  Illiteracy ~ state.region,
  data=states
)
	Kruskal-Wallis rank sum test

data:  Illiteracy by state.region
Kruskal-Wallis chi-squared = 22.672, df = 3, p-value = 4.726e-05

组件差异的可视化

箱线图

核密度图

第 9 章和第 19 章介绍的图形

参考

参考文献

[1] 面向数据科学家的实用统计学

R 语言实战

图形初阶

基本数据管理

高级数据管理

基本图形