R语言实战:基本统计分析
本文内容来自《R 语言实战》(R in Action, 2nd),有部分修改
描述性统计分析
mtcars 数据集中的三个连续变量
mpg:每加仑汽油行驶英里数
hp:马力
wt:车重
selected_variables <- c("mpg", "hp", "wt")
head(mtcars[selected_variables])
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
两个分类变量
am:变速箱类型
cyl:汽缸数
方法云集
summary(mtcars[selected_variables])
mpg hp wt
Min. :10.40 Min. : 52.0 Min. :1.513
1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581
Median :19.20 Median :123.0 Median :3.325
Mean :20.09 Mean :146.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610
Max. :33.90 Max. :335.0 Max. :5.424
sapply()
函数格式
sapply(x, FUN, options)
fivenum()
返回图基五数总括 (Tukey’s five-number summary)
sapply(
mtcars[selected_variables],
fivenum
)
mpg hp wt
[1,] 10.40 52 1.5130
[2,] 15.35 96 2.5425
[3,] 19.20 123 3.3250
[4,] 22.80 180 3.6500
[5,] 33.90 335 5.4240
自定义统计函数
在统计学中,峰度(Kurtosis)衡量实数随机变量概率分布的峰态。 峰度高就意味着方差增大是由低频度的大于或小于平均值的极端差值引起的。
– wiki
在概率论和统计学中,偏度衡量实数随机变量概率分布的不对称性。
– wiki
my_stats <- function(x, na.omit=TRUE) {
if(na.omit) {
x <- x[!is.na(x)]
}
m <- mean(x)
n <- length(x)
s <- sd(x)
skew <- sum((x-m)^3/s^3)/n # 偏度 Skewness
kurt <- sum((x-m)^4/s^4)/n - 3 # 峰度 Kurtosis
return (
c(
n=n,
mean=m,
stdev=s,
skew=skew,
kurtosis=kurt
)
)
}
sapply(mtcars[selected_variables], my_stats)
mpg hp wt
n 32.000000 32.0000000 32.00000000
mean 20.090625 146.6875000 3.21725000
stdev 6.026948 68.5628685 0.97845744
skew 0.610655 0.7260237 0.42314646
kurtosis -0.372766 -0.1355511 -0.02271075
更多方法
Hmisc
包中的 describe()
函数
library(Hmisc)
describe(mtcars[selected_variables])
mtcars[selected_variables]
3 Variables 32 Observations
---------------------------------------------------------------------------------------------------------
mpg
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
32 0 25 0.999 20.09 6.796 12.00 14.34 15.43 19.20 22.80
.90 .95
30.09 31.30
lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
---------------------------------------------------------------------------------------------------------
hp
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
32 0 22 0.997 146.7 77.04 63.65 66.00 96.50 123.00 180.00
.90 .95
243.50 253.55
lowest : 52 62 65 66 91, highest: 215 230 245 264 335
---------------------------------------------------------------------------------------------------------
wt
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
32 0 29 0.999 3.217 1.089 1.736 1.956 2.581 3.325 3.610
.90 .95
4.048 5.293
lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424
---------------------------------------------------------------------------------------------------------
pastecs
包中的 stat.desc()
函数
library(pastecs)
stat.desc(mtcars[selected_variables])
mpg hp wt
nbr.val 32.0000000 32.0000000 32.0000000
nbr.null 0.0000000 0.0000000 0.0000000
nbr.na 0.0000000 0.0000000 0.0000000
min 10.4000000 52.0000000 1.5130000
max 33.9000000 335.0000000 5.4240000
range 23.5000000 283.0000000 3.9110000
sum 642.9000000 4694.0000000 102.9520000
median 19.2000000 123.0000000 3.3250000
mean 20.0906250 146.6875000 3.2172500
SE.mean 1.0654240 12.1203173 0.1729685
CI.mean.0.95 2.1729465 24.7195501 0.3527715
var 36.3241028 4700.8669355 0.9573790
std.dev 6.0269481 68.5628685 0.9784574
coef.var 0.2999881 0.4674077 0.3041285
psych
包中的 describe()
函数
library(psych)
注:可以使用 包名::函数名
的方式访问被覆盖的函数
psych::describe(mtcars[selected_variables])
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
hp 2 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
wt 3 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17
分组计算描述性统计量
aggregate()
函数 by
参数
aggregate(
mtcars[selected_variables],
by=list(am=mtcars$am),
mean
)
am mpg hp wt
1 0 17.14737 160.2632 3.768895
2 1 24.39231 126.8462 2.411000
aggregate(
mtcars[selected_variables],
by=list(am=mtcars$am),
sd
)
am mpg hp wt
1 0 3.833966 53.90820 0.7774001
2 1 6.166504 84.06232 0.6169816
by()
函数
dstats <- function(x) {
return (sapply(x, my_stats))
}
by(
mtcars[selected_variables],
mtcars$am,
dstats
)
mtcars$am: 0
mpg hp wt
n 19.00000000 19.00000000 19.0000000
mean 17.14736842 160.26315789 3.7688947
stdev 3.83396639 53.90819573 0.7774001
skew 0.01395038 -0.01422519 0.9759294
kurtosis -0.80317826 -1.20969733 0.1415676
------------------------------------------------------------------------------
mtcars$am: 1
mpg hp wt
n 13.00000000 13.0000000 13.0000000
mean 24.39230769 126.8461538 2.4110000
stdev 6.16650381 84.0623243 0.6169816
skew 0.05256118 1.3598859 0.2103128
kurtosis -1.45535200 0.5634635 -1.1737358
分组计算的扩展
doBy
包中的 summaryBy()
函数
var1 + var2 + var3 + … + varN ~ groupvar1 + groupvar2 + … + groupvarN
library(doBy)
summaryBy(
mpg + hp + wt ~ am,
data=mtcars,
FUN=my_stats
)
am mpg.n mpg.mean mpg.stdev mpg.skew mpg.kurtosis hp.n hp.mean hp.stdev hp.skew hp.kurtosis
1 0 19 17.14737 3.833966 0.01395038 -0.8031783 19 160.2632 53.90820 -0.01422519 -1.2096973
2 1 13 24.39231 6.166504 0.05256118 -1.4553520 13 126.8462 84.06232 1.35988586 0.5634635
wt.n wt.mean wt.stdev wt.skew wt.kurtosis
1 19 3.768895 0.7774001 0.9759294 0.1415676
2 13 2.411000 0.6169816 0.2103128 -1.1737358
psych
包中的 describeBy()
函数
describeBy(
mtcars[selected_variables],
list(am=mtcars$am)
)
Descriptive statistics by group
am: 0
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 19 17.15 3.83 17.30 17.12 3.11 10.40 24.40 14.00 0.01 -0.80 0.88
hp 2 19 160.26 53.91 175.00 161.06 77.10 62.00 245.00 183.00 -0.01 -1.21 12.37
wt 3 19 3.77 0.78 3.52 3.75 0.45 2.46 5.42 2.96 0.98 0.14 0.18
------------------------------------------------------------------------------
am: 1
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 13 24.39 6.17 22.80 24.38 6.67 15.00 33.90 18.90 0.05 -1.46 1.71
hp 2 13 126.85 84.06 109.00 114.73 63.75 52.00 335.00 283.00 1.36 0.56 23.31
wt 3 13 2.41 0.62 2.32 2.39 0.68 1.51 3.57 2.06 0.21 -1.17 0.17
结果的可视化
直方图
密度图
箱线图
点图
频数表和列联表
使用 vcd
包中的 Arthritis
数据集
library(vcd)
head(Arthritis)
ID Treatment Sex Age Improved
1 57 Treated Male 27 Some
2 46 Treated Male 29 None
3 77 Treated Male 30 None
4 17 Treated Male 32 Marked
5 36 Treated Male 46 Marked
6 23 Treated Male 58 Marked
生成频数表
一维列联表
table()
生成频数统计表
one_d_table <- with(
Arthritis,
table(Improved)
)
one_d_table
Improved
None Some Marked
42 14 28
prop.table()
将频数统计表转为比例值
prop.table(my_table)
Improved
None Some Marked
0.5000000 0.1666667 0.3333333
prop.table(my_table) * 100
Improved
None Some Marked
50.00000 16.66667 33.33333
二维列联表
table(
Arthritis$Treatment, # 行
Arthritis$Improved # 列
)
None Some Marked
Placebo 29 7 7
Treated 13 7 21
xtabs()
函数
my_table <- xtabs(
~ Treatment + Improved,
data=Arthritis
)
my_table
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
margin.table()
生成边际频数
1 表示每行生成一个边际值,即为第 1 个维度生成边际值
margin.table(my_table, 1)
Treatment
Placebo Treated
43 41
margin.table(my_table, 2)
Improved
None Some Marked
42 14 28
prop.table()
生成比例
1 表示沿行生成比例,即沿第一个维度计算比例
prop.table(my_table, 1)
Improved
Treatment None Some Marked
Placebo 0.6744186 0.1627907 0.1627907
Treated 0.3170732 0.1707317 0.5121951
prop.table(my_table, 2)
Improved
Treatment None Some Marked
Placebo 0.6744186 0.1627907 0.1627907
Treated 0.3170732 0.1707317 0.5121951
addmargins()
添加边际和
addmargins(my_table)
Improved
Treatment None Some Marked Sum
Placebo 29 7 7 43
Treated 13 7 21 41
Sum 42 14 28 84
1 表示行,2 表示列。
下面代码中的 1 表示按行求比例,即每行所有数值加和为 1
2 表示添加列方向的累加和,也就是为每行添加一个累加值
addmargins(
prop.table(
my_table,
1
),
2
)
Improved
Treatment None Some Marked Sum
Placebo 0.6744186 0.1627907 0.1627907 1.0000000
Treated 0.3170732 0.1707317 0.5121951 1.0000000
addmargins(
prop.table(
my_table,
2
),
1
)
Improved
Treatment None Some Marked
Placebo 0.6904762 0.5000000 0.2500000
Treated 0.3095238 0.5000000 0.7500000
Sum 1.0000000 1.0000000 1.0000000
gmodels
包中的 CrossTable()
函数
library(gmodels)
CrossTable(
Arthritis$Treatment,
Arthritis$Improved
)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 84
| Arthritis$Improved
Arthritis$Treatment | None | Some | Marked | Row Total |
--------------------|-----------|-----------|-----------|-----------|
Placebo | 29 | 7 | 7 | 43 |
| 2.616 | 0.004 | 3.752 | |
| 0.674 | 0.163 | 0.163 | 0.512 |
| 0.690 | 0.500 | 0.250 | |
| 0.345 | 0.083 | 0.083 | |
--------------------|-----------|-----------|-----------|-----------|
Treated | 13 | 7 | 21 | 41 |
| 2.744 | 0.004 | 3.935 | |
| 0.317 | 0.171 | 0.512 | 0.488 |
| 0.310 | 0.500 | 0.750 | |
| 0.155 | 0.083 | 0.250 | |
--------------------|-----------|-----------|-----------|-----------|
Column Total | 42 | 14 | 28 | 84 |
| 0.500 | 0.167 | 0.333 | |
--------------------|-----------|-----------|-----------|-----------|
多维列联表
my_table <- xtabs(
~ Treatment + Sex + Improved,
data=Arthritis
)
my_table
, , Improved = None
Sex
Treatment Female Male
Placebo 19 10
Treated 6 7
, , Improved = Some
Sex
Treatment Female Male
Placebo 7 0
Treated 5 2
, , Improved = Marked
Sex
Treatment Female Male
Placebo 6 1
Treated 16 5
ftable()
以一种更紧凑的形式输出多维列联表
ftable(my_table)
Improved None Some Marked
Treatment Sex
Placebo Female 19 7 6
Male 10 0 1
Treated Female 6 5 16
Male 7 2 5
边际频数
margin.table(
my_table,
1
)
Treatment
Placebo Treated
43 41
margin.table(
my_table,
2
)
Sex
Female Male
59 25
margin.table(
my_table,
3
)
Improved
None Some Marked
42 14 28
多维边际频数
margin.table(
my_table,
c(1, 3)
)
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
比例
ftable(
prop.table(
my_table,
c(1, 2)
)
)
Improved None Some Marked
Treatment Sex
Placebo Female 0.59375000 0.21875000 0.18750000
Male 0.90909091 0.00000000 0.09090909
Treated Female 0.22222222 0.18518519 0.59259259
Male 0.50000000 0.14285714 0.35714286
ftable(
addmargins(
prop.table(
my_table,
c(1, 2)
),
3 # 为第三维 (Improved) 增加列加和
)
) * 100
Improved None Some Marked Sum
Treatment Sex
Placebo Female 59.375000 21.875000 18.750000 100.000000
Male 90.909091 0.000000 9.090909 100.000000
Treated Female 22.222222 18.518519 59.259259 100.000000
Male 50.000000 14.285714 35.714286 100.000000
独立性检验
卡方独立检验
卡方检验适用于计数数据,可以检验数据与预期分布的拟合程度。 在统计实践中,卡方统计量的最常见用法是与 r x c 列联表一起使用,以评估对变量间独立性的零假设是否合理。
引自 [1]
chisq.test()
下面示例显示治疗情况和改善情况不独立 (p值太小)
my_table <- xtabs(
~ Treatment + Improved,
data=Arthritis
)
my_table
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
chisq.test(my_table)
Pearson's Chi-squared test
data: my_table
X-squared = 13.055, df = 2, p-value = 0.001463
p 值表示从总体中抽取的样本行变量与列变量是相互独立的概率。
下面示例显示性别和改善情况独立
my_table <- xtabs(
~ Improved + Sex,
data=Arthritis
)
my_table
Sex
Improved Female Male
None 25 17
Some 12 2
Marked 22 6
chisq.test(my_table)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: my_table
X-squared = 4.8407, df = 2, p-value = 0.08889
Fisher 精确检验
可以实际列出所有可能出现的重排 (置换) 情况及其频数,进而确定观测结果的极端程度。 这一操作被称为费舍尔精确检验 (Fisher’s exact test)。
引自 [1]
my_table <- xtabs(
~ Treatment + Improved,
data=Arthritis
)
my_table
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
fisher.test(my_table)
Fisher's Exact Test for Count Data
data: my_table
p-value = 0.001393
alternative hypothesis: two.sided
Cochran-Mantel-Haenszel 卡方检验
假设两个名义变量在第三个变量的每一层中都是条件独立的。
下面代码假设不存在三阶交互作用 (治疗情况 x 改善情况 x 性别)
my_table <- xtabs(
~ Treatment + Improved + Sex,
data=Arthritis
)
ftable(my_table)
Sex Female Male
Treatment Improved
Placebo None 19 10
Some 7 0
Marked 6 1
Treated None 6 7
Some 5 2
Marked 16 5
结果表明,治疗与得到的改善在性别的每一水平下并不独立
mantelhaen.test(my_table)
Cochran-Mantel-Haenszel test
data: my_table
Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647
相关性的度量
vcd
包的 assocstats()
函数
计算二维列联表的 phi 系数,列联系数和 Cramer’s V 系数
my_table <- xtabs(
~ Treatment + Improved,
data=Arthritis
)
my_table
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
assocstats(my_table)
X^2 df P(> X^2)
Likelihood Ratio 13.530 2 0.0011536
Pearson 13.055 2 0.0014626
Phi-Coefficient : NA
Contingency Coeff.: 0.367
Cramer's V : 0.394
结果的可视化
条形图
马赛克图
关联图
…
相关
相关系数
head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
相关的类型
相关系数
Pearson 积差相关系数衡量两个定量变量之间的线性相关程度。
Spearman 等级相关系数衡量分级定序变量之间的相关程度。
Kendall’s Tau 相关系数是一种非参数的等级相关度量。
cov()
计算协方差
cor()
计算相关系数
use
:all.obs
,everything
,complete.obs
,pairwise.complete.obs
method
: pearson
,spearman
,kendall
默认参数 use="everything", method="pearson"
states <- state.x77[, 1:6]
cov(states)
Population Income Illiteracy Life Exp Murder HS Grad
Population 19931683.7588 571229.7796 292.8679592 -407.8424612 5663.523714 -3551.509551
Income 571229.7796 377573.3061 -163.7020408 280.6631837 -521.894286 3076.768980
Illiteracy 292.8680 -163.7020 0.3715306 -0.4815122 1.581776 -3.235469
Life Exp -407.8425 280.6632 -0.4815122 1.8020204 -3.869480 6.312685
Murder 5663.5237 -521.8943 1.5817755 -3.8694804 13.627465 -14.549616
HS Grad -3551.5096 3076.7690 -3.2354694 6.3126849 -14.549616 65.237894
Pearson 积差相关系数
cor(states)
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.00000000 0.2082276 0.1076224 -0.06805195 0.3436428 -0.09848975
Income 0.20822756 1.0000000 -0.4370752 0.34025534 -0.2300776 0.61993232
Illiteracy 0.10762237 -0.4370752 1.0000000 -0.58847793 0.7029752 -0.65718861
Life Exp -0.06805195 0.3402553 -0.5884779 1.00000000 -0.7808458 0.58221620
Murder 0.34364275 -0.2300776 0.7029752 -0.78084575 1.0000000 -0.48797102
HS Grad -0.09848975 0.6199323 -0.6571886 0.58221620 -0.4879710 1.00000000
Spearman 等级相关系数
cor(states, method="spearman")
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.0000000 0.1246098 0.3130496 -0.1040171 0.3457401 -0.3833649
Income 0.1246098 1.0000000 -0.3145948 0.3241050 -0.2174623 0.5104809
Illiteracy 0.3130496 -0.3145948 1.0000000 -0.5553735 0.6723592 -0.6545396
Life Exp -0.1040171 0.3241050 -0.5553735 1.0000000 -0.7802406 0.5239410
Murder 0.3457401 -0.2174623 0.6723592 -0.7802406 1.0000000 -0.4367330
HS Grad -0.3833649 0.5104809 -0.6545396 0.5239410 -0.4367330 1.0000000
非方形的相关矩阵
x <- states[, c(
"Population",
"Income",
"Illiteracy",
"HS Grad"
)]
y <- states[, c(
"Life Exp",
"Murder"
)]
cor(x, y)
Life Exp Murder
Population -0.06805195 0.3436428
Income 0.34025534 -0.2300776
Illiteracy -0.58847793 0.7029752
HS Grad 0.58221620 -0.4879710
偏相关
偏相关是指在控制一个或多个定量变量时,另外两个定量变量之间的相互关系。
ggm
包中的 pcor()
函数
library(ggm)
colnames(states)
[1] "Population" "Income" "Illiteracy" "Life Exp" "Murder" "HS Grad"
u
中前两个数值表示计算相关向量的下标,其余数值为条件变量下标
pcor(u, S)
pcor(
c(1, 5, 2, 3, 6),
cov(states)
)
[1] 0.3462724
其他类型的相关
polycor
包中的 hetcor()
计算一种混合的相关矩阵
library(polycor)
hetcor(states)
Two-Step Estimates
Correlations/Type of Correlation:
Population Income Illiteracy Life.Exp Murder HS.Grad
Population 1 Pearson Pearson Pearson Pearson Pearson
Income 0.2082 1 Pearson Pearson Pearson Pearson
Illiteracy 0.1076 -0.4371 1 Pearson Pearson Pearson
Life.Exp -0.06805 0.3403 -0.5885 1 Pearson Pearson
Murder 0.3436 -0.2301 0.703 -0.7808 1 Pearson
HS.Grad -0.09849 0.6199 -0.6572 0.5822 -0.488 1
Standard Errors:
Population Income Illiteracy Life.Exp Murder
Population
Income 0.1357
Illiteracy 0.1401 0.1152
Life.Exp 0.1411 0.1257 0.09337
Murder 0.1253 0.1344 0.07247 0.05604
HS.Grad 0.1404 0.08801 0.08129 0.0944 0.1086
n = 50
P-values for Tests of Bivariate Normality:
Population Income Illiteracy Life.Exp Murder
Population
Income 0.003105
Illiteracy 0.004307 0.02102
Life.Exp 0.01916 0.03275 0.1717
Murder 0.1015 0.174 0.2613 0.07242
HS.Grad 0.007223 0.114 0.005472 0.1134 0.02036
相关性的显著性检验
常用原假设为变量间不相关(即总体的相关系数为0)
cor.test()
alternative
:two.side
,less
,greater
method
:pearson
,kendall
,spearman
cor.test(states[,3], states[,5])
Pearson's product-moment correlation
data: states[, 3] and states[, 5]
t = 6.8479, df = 48, p-value = 1.258e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5279280 0.8207295
sample estimates:
cor
0.7029752
psych
包中 corr.test()
函数
corr.test(
states,
use="complete"
)
Call:corr.test(x = states, use = "complete")
Correlation matrix
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.00 0.21 0.11 -0.07 0.34 -0.10
Income 0.21 1.00 -0.44 0.34 -0.23 0.62
Illiteracy 0.11 -0.44 1.00 -0.59 0.70 -0.66
Life Exp -0.07 0.34 -0.59 1.00 -0.78 0.58
Murder 0.34 -0.23 0.70 -0.78 1.00 -0.49
HS Grad -0.10 0.62 -0.66 0.58 -0.49 1.00
Sample Size
[1] 50
Probability values (Entries above the diagonal are adjusted for multiple tests.)
Population Income Illiteracy Life Exp Murder HS Grad
Population 0.00 0.59 1.00 1.0 0.10 1
Income 0.15 0.00 0.01 0.1 0.54 0
Illiteracy 0.46 0.00 0.00 0.0 0.00 0
Life Exp 0.64 0.02 0.00 0.0 0.00 0
Murder 0.01 0.11 0.00 0.0 0.00 0
HS Grad 0.50 0.00 0.00 0.0 0.00 0
To see confidence intervals of the correlations, print with the short=FALSE option
其他显著性检验
ggm
包的 pcor.test()
函数
psych
包的 r.test()
函数
相关关系的可视化
散点图和散点图矩阵
相关图 (correlogram)
t 检验
本节关注的变量为连续型的组间比较,假设其呈正态分布
library(MASS)
head(UScrime)
M So Ed Po1 Po2 LF M.F Pop NW U1 U2 GDP Ineq Prob Time y
1 151 1 91 58 56 510 950 33 301 108 41 394 261 0.084602 26.2011 791
2 143 0 113 103 95 583 1012 13 102 96 36 557 194 0.029599 25.2999 1635
3 142 1 89 45 44 533 969 18 219 94 33 318 250 0.083401 24.3006 578
4 136 0 121 149 141 577 994 157 80 102 39 673 167 0.015801 29.9012 1969
5 141 0 121 109 101 591 985 18 30 91 20 578 174 0.041399 21.2998 1234
6 121 0 110 118 115 547 964 25 44 84 29 689 126 0.034201 20.9995 682
Prob
:监禁的概率U1
:14-24 岁年龄段城市男性失业率U2
:35-39 岁年龄段城市男性失业率So
:分类变量,是否为南方州
独立样本的 t 检验
t.test(
Prob ~ So,
data=UScrime
)
Welch Two Sample t-test
data: Prob by So
t = -3.8954, df = 24.925, p-value = 0.0006506
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.03852569 -0.01187439
sample estimates:
mean in group 0 mean in group 1
0.03851265 0.06371269
可以拒绝南方各州和非南方各州拥有相同监禁概率的假设 (p < 0.001)
非独立样本的 t 检验
非独立组设计 dependent groups design
sapply(
UScrime[c("U1", "U2")],
function(x) (c(mean=mean(x), sd=sd(x)))
)
U1 U2
mean 95.46809 33.97872
sd 18.02878 8.44545
with(
UScrime,
t.test(U1, U2, paried=TRUE)
)
Welch Two Sample t-test
data: U1 and U2
t = 21.174, df = 65.261, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
55.69010 67.28862
sample estimates:
mean of x mean of y
95.46809 33.97872
多于两组的情况
方差分析(NOAA):第 9 章
组间差异的非参数检验
两组的比较
两组数据独立,可以使用 Wilcoxon 秩和检验 (Mann-Whitney U 检验) 评估观测是否是从相同的概率分布中抽得的。
with(
UScrime,
by(Prob, So, median)
)
So: 0
[1] 0.038201
------------------------------------------------------------------------
So: 1
[1] 0.055552
wilcox.test(
Prob ~ So,
data=UScrime
)
Wilcoxon rank sum exact test
data: Prob by So
W = 81, p-value = 8.488e-05
alternative hypothesis: true location shift is not equal to 0
Wilcoxon 符号秩检验是非独立样本 t 检验的一种非参数替代方法,适用于两组成对数据和无法保证正态性假设的情境。
sapply(
UScrime[c("U1", "U2")],
median
)
U1 U2
92 34
with(
UScrime,
wilcox.test(
U1, U2,
paired=TRUE
)
)
cannot compute exact p-value with ties
Wilcoxon signed rank test with continuity correction
data: U1 and U2
V = 1128, p-value = 2.464e-09
alternative hypothesis: true location shift is not equal to 0
多于两组的比较
单向设计 one-way design
各组独立:Kruskal-Wallis 检验
各组不独立:Friedman 检验
states <- data.frame(
state.region,
state.x77
)
head(states)
state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Alabama South 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska West 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona West 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas South 2110 3378 1.9 70.66 10.1 39.9 65 51945
California West 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado West 2541 4884 0.7 72.06 6.8 63.9 166 103766
kruskal.test(
Illiteracy ~ state.region,
data=states
)
Kruskal-Wallis rank sum test
data: Illiteracy by state.region
Kruskal-Wallis chi-squared = 22.672, df = 3, p-value = 4.726e-05
组件差异的可视化
箱线图
核密度图
第 9 章和第 19 章介绍的图形
参考
参考文献
[1] 面向数据科学家的实用统计学
R 语言实战
《图形初阶》
《基本数据管理》
《高级数据管理》
《基本图形》