R语言实战:主成分分析和因子分析

目录

本文内容来自《R 语言实战》(R in Action, 2nd),有部分修改

主成分分析 (PCA):一种数据降维技巧,将大量相关变量转化为一组很少的不相关变量,这些无关变量被称为主成分。

探索性因子分析 (EDA):用来发现一组变量的潜在结构的一系列方法,通过寻找一组更小的、潜在的或隐藏的结构来解释已观测到的、显式的变量间的关系。

R 中的主成分和因子分析

library(psych)

常见步骤:

  • 数据预处理。输入原始数据矩阵或者相关系数矩阵给 principal()fa() 函数
  • 选择因子模型
  • 判断要选择的主成分/因子数目
  • 选择主成分/因子
  • 旋转主成分/因子
  • 解释结果
  • 计算主成分或因子得分

主成分分析

head(USJudgeRatings)
               CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN
AARONSON,L.H.   5.7  7.9  7.7  7.3  7.1  7.4  7.1  7.1  7.1  7.0  8.3  7.8
ALEXANDER,J.M.  6.8  8.9  8.8  8.5  7.8  8.1  8.0  8.0  7.8  7.9  8.5  8.7
ARMENTANO,A.J.  7.2  8.1  7.8  7.8  7.5  7.6  7.5  7.5  7.3  7.4  7.9  7.8
BERDON,R.I.     6.8  8.8  8.5  8.8  8.3  8.5  8.7  8.7  8.4  8.5  8.8  8.7
BRACKEN,J.J.    7.3  6.4  4.3  6.5  6.0  6.2  5.7  5.7  5.1  5.3  5.5  4.8
BURNS,E.B.      6.2  8.8  8.7  8.5  7.9  8.0  8.1  8.0  8.0  8.0  8.6  8.6

判断主成分的个数

Kaiser-Harris 准则

Cattell 碎石检验

平行分析

fa.parallel(
  USJudgeRatings[,-1],
  fa="pc",
  n.iter=100,
  show.legend=TRUE,
  main="Scree plot with parallel analysis"
)

基于观测特征值的碎石检验:线段和 x 符号

根据 100 个随机数据矩阵推到出来的特征值均值:虚线

大于 1 的特征值准则:y = 1 的水平线

以上三种准则都表明选择一个主成分即可保留数据集的大部分信息

提取主成分

USJudgeRatings

提取一个未旋转的主成分

pc <- principal(
  USJudgeRatings[,-1],
  nfractors=1
)
pc
Principal Components Analysis
Call: principal(r = USJudgeRatings[, -1], nfractors = 1)
Standardized loadings (pattern matrix) based upon correlation matrix
      PC1   h2     u2 com
INTG 0.92 0.84 0.1565   1
DMNR 0.91 0.83 0.1663   1
DILG 0.97 0.94 0.0613   1
CFMG 0.96 0.93 0.0720   1
DECI 0.96 0.92 0.0763   1
PREP 0.98 0.97 0.0299   1
FAMI 0.98 0.95 0.0469   1
ORAL 1.00 0.99 0.0091   1
WRIT 0.99 0.98 0.0196   1
PHYS 0.89 0.80 0.2013   1
RTEN 0.99 0.97 0.0275   1

                 PC1
SS loadings    10.13
Proportion Var  0.92

Mean item complexity =  1
Test of the hypothesis that 1 component is sufficient.

The root mean square of the residuals (RMSR) is  0.04 
 with the empirical chi square  6.21  with prob <  1 

Fit based upon off diagonal values = 1

PC1:成分载荷,观测变量与主成分的相关系数。PC1 表示第一个主成分

h2:公因子方差,主成分对每个变量的方差解释度

u2:成分唯一性,方差无法被主成分解释的比例,1 - h2

SS loadings:与主成分相关的特征值

Proportion Var:每个主成分对整个数据集的解释程度

Harman23.cor

数据集由变量的相关系数组成

head(Harman23.cor)
$cov
               height arm.span forearm lower.leg weight bitro.diameter
height          1.000    0.846   0.805     0.859  0.473          0.398
arm.span        0.846    1.000   0.881     0.826  0.376          0.326
forearm         0.805    0.881   1.000     0.801  0.380          0.319
lower.leg       0.859    0.826   0.801     1.000  0.436          0.329
weight          0.473    0.376   0.380     0.436  1.000          0.762
bitro.diameter  0.398    0.326   0.319     0.329  0.762          1.000
chest.girth     0.301    0.277   0.237     0.327  0.730          0.583
chest.width     0.382    0.415   0.345     0.365  0.629          0.577
               chest.girth chest.width
height               0.301       0.382
arm.span             0.277       0.415
forearm              0.237       0.345
lower.leg            0.327       0.365
weight               0.730       0.629
bitro.diameter       0.583       0.577
chest.girth          1.000       0.539
chest.width          0.539       1.000

$center
[1] 0 0 0 0 0 0 0 0

$n.obs
[1] 305
fa.parallel(
  Harman23.cor$cov,
  n.obs=302,
  fa="pc",
  n.iter=100,
  show.legend=TRUE,
  main="Scree plot with parallel analysis"
)

三个准则都建议选择两个主成分

pc <- principal(
  Harman23.cor$cov,
  nfactors=2,
  rotate="none"
)
pc
Principal Components Analysis
Call: principal(r = Harman23.cor$cov, nfactors = 2, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
                PC1   PC2   h2    u2 com
height         0.86 -0.37 0.88 0.123 1.4
arm.span       0.84 -0.44 0.90 0.097 1.5
forearm        0.81 -0.46 0.87 0.128 1.6
lower.leg      0.84 -0.40 0.86 0.139 1.4
weight         0.76  0.52 0.85 0.150 1.8
bitro.diameter 0.67  0.53 0.74 0.261 1.9
chest.girth    0.62  0.58 0.72 0.283 2.0
chest.width    0.67  0.42 0.62 0.375 1.7

                       PC1  PC2
SS loadings           4.67 1.77
Proportion Var        0.58 0.22
Cumulative Var        0.58 0.81
Proportion Explained  0.73 0.27
Cumulative Proportion 0.73 1.00

Mean item complexity =  1.7
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.05 

Fit based upon off diagonal values = 0.99

第一主成分解释 58% 方差,第二主成分解释 22% 方差,两者一共解释 81% 方差

主成分旋转

旋转是将成分载荷阵变得更容易解释的一系列数学方法,尽可能对成分去噪。 方法有两种:

  • 正交旋转:使选择的成分保持不相关
  • 斜交旋转:让它们变得相关

最流行的是方差极大旋转,是每个成分只由一组有限的变量来解释

rc <- principal(
  Harman23.cor$cov,
  nfactors=2,
  rotate="varimax"
)
rc
Principal Components Analysis
Call: principal(r = Harman23.cor$cov, nfactors = 2, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
                RC1  RC2   h2    u2 com
height         0.90 0.25 0.88 0.123 1.2
arm.span       0.93 0.19 0.90 0.097 1.1
forearm        0.92 0.16 0.87 0.128 1.1
lower.leg      0.90 0.22 0.86 0.139 1.1
weight         0.26 0.88 0.85 0.150 1.2
bitro.diameter 0.19 0.84 0.74 0.261 1.1
chest.girth    0.11 0.84 0.72 0.283 1.0
chest.width    0.26 0.75 0.62 0.375 1.2

                       RC1  RC2
SS loadings           3.52 2.92
Proportion Var        0.44 0.37
Cumulative Var        0.44 0.81
Proportion Explained  0.55 0.45
Cumulative Proportion 0.55 1.00

Mean item complexity =  1.1
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.05 

Fit based upon off diagonal values = 0.99

第一个主成分主要由前四个变量解释 (长度变量),第二个主成分主要由后四个变量解释 (容量变量)

获得主成分得分

USJudgeRatings

获得每个观测在某个主成分上的得分

pc <- principal(
  USJudgeRatings[,-1],
  nfactors=1,
  score=TRUE
)
head(pc$scores)
                      PC1
AARONSON,L.H.  -0.1857981
ALEXANDER,J.M.  0.7469865
ARMENTANO,A.J.  0.0704772
BERDON,R.I.     1.1358765
BRACKEN,J.J.   -2.1586211
BURNS,E.B.      0.7669406

律师与法官的接触频数与法官评分间的相关系数 (毫无关联)

cov(USJudgeRatings$CONT, pc$scores)
              PC1
[1,] -0.008294671

Harman23.cor

使用相关系数矩阵时,仅能计算得分系数

rc <- principal(
  Harman23.cor$cov,
  nfactors=2,
  rotate="varimax"
)
round(rc$weights, 2)
                 RC1   RC2
height          0.28 -0.05
arm.span        0.30 -0.08
forearm         0.30 -0.09
lower.leg       0.28 -0.06
weight         -0.06  0.33
bitro.diameter -0.08  0.32
chest.girth    -0.10  0.34
chest.width    -0.04  0.27

探索性因子分析

EFA 目标是通过发掘隐藏在数据下的一组较少的、更为基本的无法观测的变量,来解释一组可供观测变量的相关性。 这些虚拟的、无法观测的变量称作因子。

$$ X_i = a_1F_1 + a_2F_2 + … + a_pF_p + U_i $$

其中 X_i 是第 i 个可观测变量,F_j 是公共因子,p < k,U_i 是 X_i 变量独有的部分,无法被公共因子解释

ability.cov 数据集提供协方差矩阵

ability.cov
$cov
        general picture  blocks   maze reading   vocab
general  24.641   5.991  33.520  6.023  20.755  29.701
picture   5.991   6.700  18.137  1.782   4.936   7.204
blocks   33.520  18.137 149.831 19.424  31.430  50.753
maze      6.023   1.782  19.424 12.711   4.757   9.075
reading  20.755   4.936  31.430  4.757  52.604  66.762
vocab    29.701   7.204  50.753  9.075  66.762 135.292

$center
[1] 0 0 0 0 0 0

$n.obs
[1] 112

使用 cov2cor() 函数将协方差矩阵转为相关系数矩阵

covariances <- ability.cov$cov
correlations <- cov2cor(covariances)
correlations
          general   picture    blocks      maze   reading     vocab
general 1.0000000 0.4662649 0.5516632 0.3403250 0.5764799 0.5144058
picture 0.4662649 1.0000000 0.5724364 0.1930992 0.2629229 0.2392766
blocks  0.5516632 0.5724364 1.0000000 0.4450901 0.3540252 0.3564715
maze    0.3403250 0.1930992 0.4450901 1.0000000 0.1839645 0.2188370
reading 0.5764799 0.2629229 0.3540252 0.1839645 1.0000000 0.7913779
vocab   0.5144058 0.2392766 0.3564715 0.2188370 0.7913779 1.0000000

判断需提取的公因子数

fa.parallel(
  correlations,
  n.obs=221,
  fa="both",
  n.iter=100,
  main="Scree plots with parallel analysis"
)

提取公因子

提取两个因子,使用主轴迭代法 (fm="pa") 提取未旋转因子

fa.pa <- fa(
  correlations,
  nfactors=2,
  rotate="none",
  fm="pa"
)
fa.pa
Factor Analysis using method =  pa
Call: fa(r = correlations, nfactors = 2, rotate = "none", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
         PA1   PA2   h2    u2 com
general 0.75  0.07 0.57 0.432 1.0
picture 0.52  0.32 0.38 0.623 1.7
blocks  0.75  0.52 0.83 0.166 1.8
maze    0.39  0.22 0.20 0.798 1.6
reading 0.81 -0.51 0.91 0.089 1.7
vocab   0.73 -0.39 0.69 0.313 1.5

                       PA1  PA2
SS loadings           2.75 0.83
Proportion Var        0.46 0.14
Cumulative Var        0.46 0.60
Proportion Explained  0.77 0.23
Cumulative Proportion 0.77 1.00

Mean item complexity =  1.5
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  15  and the objective function was  2.48
The degrees of freedom for the model are 4  and the objective function was  0.07 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.06 

Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   PA1  PA2
Correlation of (regression) scores with factors   0.96 0.92
Multiple R square of scores with factors          0.93 0.84
Minimum correlation of possible factor scores     0.86 0.68

两个因子解释 60% 方差

因子旋转

正交旋转

fa.varimax <- fa(
  correlations,
  nfactors=2,
  rotate="varimax",
  fm="pa"
)
fa.varimax
Factor Analysis using method =  pa
Call: fa(r = correlations, nfactors = 2, rotate = "varimax", 
    fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
         PA1  PA2   h2    u2 com
general 0.49 0.57 0.57 0.432 2.0
picture 0.16 0.59 0.38 0.623 1.1
blocks  0.18 0.89 0.83 0.166 1.1
maze    0.13 0.43 0.20 0.798 1.2
reading 0.93 0.20 0.91 0.089 1.1
vocab   0.80 0.23 0.69 0.313 1.2

                       PA1  PA2
SS loadings           1.83 1.75
Proportion Var        0.30 0.29
Cumulative Var        0.30 0.60
Proportion Explained  0.51 0.49
Cumulative Proportion 0.51 1.00

Mean item complexity =  1.3
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  15  and the objective function was  2.48
The degrees of freedom for the model are 4  and the objective function was  0.07 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.06 

Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   PA1  PA2
Correlation of (regression) scores with factors   0.96 0.92
Multiple R square of scores with factors          0.91 0.85
Minimum correlation of possible factor scores     0.82 0.71

斜交转轴法,比如 promax

fa.promax <- fa(
  correlations,
  nfactors=2,
  rotate="promax",
  fm="pa"
)
fa.promax
Factor Analysis using method =  pa
Call: fa(r = correlations, nfactors = 2, rotate = "promax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
          PA1   PA2   h2    u2 com
general  0.37  0.48 0.57 0.432 1.9
picture -0.03  0.63 0.38 0.623 1.0
blocks  -0.10  0.97 0.83 0.166 1.0
maze     0.00  0.45 0.20 0.798 1.0
reading  1.00 -0.09 0.91 0.089 1.0
vocab    0.84 -0.01 0.69 0.313 1.0

                       PA1  PA2
SS loadings           1.83 1.75
Proportion Var        0.30 0.29
Cumulative Var        0.30 0.60
Proportion Explained  0.51 0.49
Cumulative Proportion 0.51 1.00

 With factor correlations of 
     PA1  PA2
PA1 1.00 0.55
PA2 0.55 1.00

Mean item complexity =  1.2
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  15  and the objective function was  2.48
The degrees of freedom for the model are 4  and the objective function was  0.07 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.06 

Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   PA1  PA2
Correlation of (regression) scores with factors   0.97 0.94
Multiple R square of scores with factors          0.93 0.88
Minimum correlation of possible factor scores     0.86 0.77

因子结构矩阵:变量与因子的相关系数

因子模式矩阵:标准化的回归系数矩阵,Standardized loadings

因子关联矩阵:因子相关系数矩阵,factor correlations

正交旋转仅考虑第一项,斜交旋转会考虑全部三项

计算因子结构矩阵 (又称因子载荷阵):F = P * Phi

  • F:因子载荷阵
  • P:因子模式矩阵
  • Phi:因子关联矩阵
fsm <- function(oblique) {
  if (class(oblique)[2] == "fa" & is.null(oblique$Phi)) {
    warning("Object doesn't look like oblique EFA")
  } else {
    P <- unclass(oblique$loading)
    F <- P %*% oblique$Phi
    colnames(F) <- c("PA1", "PA2")
    return (F)
  }
}
fsm(fa.promax)
              PA1       PA2
general 0.6362443 0.6879280
picture 0.3203301 0.6137902
blocks  0.4293252 0.9089946
maze    0.2491789 0.4496446
reading 0.9510795 0.4587468
vocab   0.8287066 0.4476297

绘图

factor.plot(
  fa.promax,
  labels=rownames(fa.promax$loadings)
)

fa.diagram(
  fa.promax,
  simple=FALSE
)

Harman74.cor 数据集

fa.24tests <- fa(
  Harman74.cor$cov,
  nfactors=4,
  rotate="promax"
)
fa.24tests
Factor Analysis using method =  minres
Call: fa(r = Harman74.cor$cov, nfactors = 4, rotate = "promax")
Standardized loadings (pattern matrix) based upon correlation matrix
                         MR1   MR3   MR2   MR4   h2   u2 com
VisualPerception       -0.08  0.78  0.06 -0.05 0.55 0.45 1.0
Cubes                  -0.02  0.53 -0.02 -0.05 0.23 0.77 1.0
PaperFormBoard          0.00  0.66 -0.16 -0.01 0.34 0.66 1.1
Flags                   0.10  0.60 -0.03 -0.10 0.35 0.65 1.1
GeneralInformation      0.79 -0.02  0.10 -0.05 0.64 0.36 1.0
PargraphComprehension   0.82  0.00 -0.11  0.09 0.68 0.32 1.1
SentenceCompletion      0.91 -0.02  0.03 -0.14 0.73 0.27 1.0
WordClassification      0.54  0.22  0.11 -0.06 0.51 0.49 1.4
WordMeaning             0.89 -0.02 -0.13  0.08 0.74 0.26 1.1
Addition                0.09 -0.39  0.97 -0.01 0.74 0.26 1.3
Code                    0.03 -0.11  0.53  0.29 0.47 0.53 1.7
CountingDots           -0.15  0.09  0.81 -0.11 0.55 0.45 1.1
StraightCurvedCapitals  0.00  0.38  0.55 -0.17 0.51 0.49 2.0
WordRecognition         0.11 -0.15 -0.07  0.65 0.36 0.64 1.2
NumberRecognition      -0.01 -0.03 -0.07  0.61 0.31 0.69 1.0
FigureRecognition      -0.15  0.39 -0.14  0.54 0.45 0.55 2.2
ObjectNumber            0.00 -0.15  0.11  0.66 0.41 0.59 1.2
NumberFigure           -0.21  0.21  0.25  0.42 0.41 0.59 2.8
FigureWord              0.01  0.16  0.07  0.32 0.23 0.77 1.6
Deduction               0.27  0.35 -0.07  0.18 0.42 0.58 2.5
NumericalPuzzles        0.00  0.34  0.38  0.03 0.42 0.58 2.0
ProblemReasoning        0.26  0.33 -0.03  0.17 0.40 0.60 2.5
SeriesCompletion        0.23  0.48  0.09  0.04 0.51 0.49 1.5
ArithmeticProblems      0.27 -0.03  0.45  0.14 0.49 0.51 1.9

                       MR1  MR3  MR2  MR4
SS loadings           3.70 2.95 2.72 2.11
Proportion Var        0.15 0.12 0.11 0.09
Cumulative Var        0.15 0.28 0.39 0.48
Proportion Explained  0.32 0.26 0.24 0.18
Cumulative Proportion 0.32 0.58 0.82 1.00

 With factor correlations of 
     MR1  MR3  MR2  MR4
MR1 1.00 0.59 0.47 0.53
MR3 0.59 1.00 0.53 0.59
MR2 0.47 0.53 1.00 0.56
MR4 0.53 0.59 0.56 1.00

Mean item complexity =  1.5
Test of the hypothesis that 4 factors are sufficient.

The degrees of freedom for the null model are  276  and the objective function was  11.44
The degrees of freedom for the model are 186  and the objective function was  1.72 

The root mean square of the residuals (RMSR) is  0.04 
The df corrected root mean square of the residuals is  0.05 

Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                   MR1  MR3  MR2  MR4
Correlation of (regression) scores with factors   0.96 0.93 0.94 0.90
Multiple R square of scores with factors          0.92 0.86 0.89 0.81
Minimum correlation of possible factor scores     0.85 0.72 0.77 0.61

因子得分

fa.promax$weights
               PA1        PA2
general 0.07836571 0.21091463
picture 0.02000694 0.09038071
blocks  0.03662858 0.70209400
maze    0.02700266 0.03453381
reading 0.74276722 0.03012597
vocab   0.17683183 0.03583324

其他与 EFA 相关的包

  • FactoMineR
  • FAiR
  • GPArotation
  • nFactors

其他潜变量模型

结构方程模型 (SEM),验证性因子分析 (CFA)

潜类别模型

简单和多重对应分析

多维标度法 (MDS)

参考

https://github.com/perillaroc/r-in-action-study

R 语言实战

图形初阶

基本数据管理

高级数据管理

基本图形

基本统计分析

回归

方差分析

中级绘图

重抽样与自助法

广义线性模型