经典线性回归模型基本假设之一是解释变量是互相独立的。如果某两个或多个解释变量之间出现了相关性,则称为多重共线性(Multicollinearity)。

载入数据

1
2
commu = read.table("~/mydata/commu.txt", header = TRUE)
print(commu)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
##      t       y     x1      x2     x3    x4    x5
## 1 1991  1.5163 0.5275 11.5823 0.2637 1.879 0.896
## 2 1992  2.2657 0.6367 11.7171 0.2763 2.287 1.070
## 3 1993  3.8245 0.8026 11.8517 0.2814 2.939 1.331
## 4 1994  5.9230 0.9589 11.9850 0.2862 3.923 1.746
## 5 1995  8.7551 1.1334 12.1121 0.2904 4.854 2.236
## 6 1996 12.0875 1.3329 12.2389 0.2937 5.576 2.641
## 7 1997 12.6895 1.4434 12.3626 0.2992 6.053 2.834
## 8 1998 22.6494 1.6628 12.4810 0.3040 6.307 2.972
## 9 1999 31.3238 1.9844 12.5909 0.3089 6.534 3.143

做矩阵散点图

1
2
library(car)
scatterplotMatrix(commu[,-1]) 

观察散点图发现y和x之间是非线性的。

1
2
3
4
5
library(tidyverse)
commu %>%
  mutate(lny = log(y)) %>%
  select(lny,x1:x5) %>%
  scatterplotMatrix()

从矩阵散点图可以看出,变量之间的线性关系非常明显,初步判断模型可能存在多重共线性。

建立对数线性模型

1
2
model_commu = lm(log(y) ~ x1 + x2 + x3 + x4 + x5, data=commu)
summary(model_commu)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
## 
## Call:
## lm(formula = log(y) ~ x1 + x2 + x3 + x4 + x5, data = commu)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.01017 -0.03369  0.05618 -0.02885  0.02188  0.07676 -0.14731  0.10720 
##        9 
## -0.04200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  24.9366    38.4521   0.649    0.563
## x1            2.1636     1.3523   1.600    0.208
## x2           -3.0346     3.9867  -0.761    0.502
## x3           33.7133    32.9394   1.023    0.381
## x4            1.2889     0.8341   1.545    0.220
## x5           -2.0272     1.6643  -1.218    0.310
## 
## Residual standard error: 0.1246 on 3 degrees of freedom
## Multiple R-squared:  0.9944,	Adjusted R-squared:  0.985 
## F-statistic: 106.3 on 5 and 3 DF,  p-value: 0.001421

线性对数模型的各个统计量都不显著,但是拟合优度非常高,初步判断模型可能存在多重共线性。

Klein判别法

1
2
3
4
5
library(corrplot)
commu %>%
  select(-t) %>%
  cor() %>%
  corrplot.mixed(upper = "ellipse")

从相关矩阵图中可以看出,至少一组变量的相关系数是大于上述模型的拟合优度的,初步判断模型存在多重共线。

病态指数法

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
commu %>%
  select(x1:x5) %>%
  as.matrix()->m

CI = function(m){
  eig = eigen(t(m) %*% m)
  eig_max = max(eig$values)
  eig_min = min(eig$values)
  ci = sqrt(eig_max/eig_min)
  names(ci) = "病态指数(CI)"
  return(ci)
}

ci_commu = CI(m)
print(ci_commu)
1
2
## 病态指数(CI) 
##       7343.246

因为病态指数CI为7343.246,大于等于10,所以存在多重共线。

方差膨胀因子(VIF)

1
2
library(car)
vif(model_commu)
1
2
##        x1        x2        x3        x4        x5 
##  222.1772  987.3654  112.9366 1149.0871 1057.7071

方差膨胀因子均大于10,说明模型存在多重共线性。
注:谨慎起见,想证明确实存在多重共线性,最好和10比较;想证明不存在多重共线性和5比较。

多重共线的克服

岭回归

1
2
3
4
5
6
library(MASS)
commu %>%
  dplyr::select(-t) %>%
  lm.ridge(log(y) ~ x1 + x2 + x3 + x4 + x5, 
           data = ., lambda = seq(0, 0.1, 0.001)) ->model_commu_ridge
plot(model_commu_ridge)  # 生成岭回迹图

1
MASS::select(model_commu_ridge)
1
2
3
## modified HKB estimator is 0.004845564 
## modified L-W estimator is 0.05081956 
## smallest value of GCV  at 0.1

从图上以及相关岭参数标准,我们认为lambda取0.06时趋于稳定

1
2
3
4
5
6
# 估计参数
commu %>%
  dplyr::select(-t) %>%
  lm.ridge(log(y) ~ x1 + x2 + x3 + x4 + x5, 
           data = ., lambda = 0.06) %>%
  coef()
1
2
3
4
##                        x1           x2           x3           x4 
## -11.26746327   0.54466708   0.39363500  24.95018656   0.14196002 
##           x5 
##   0.01325687

偏最小二乘估计

1
2
3
4
5
library(pls)
commu %>%
  dplyr::select(-t) %>%
  plsr(log(y) ~ x1 + x2 + x3 + x4 + x5, 2, data = .) ->model_commu_pls
summary(model_commu_pls)
1
2
3
4
5
6
7
8
## Data: 	X dimension: 9 5 
## 	Y dimension: 9 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
##         1 comps  2 comps
## X         99.54    99.96
## log(y)    96.95    98.46
1
model_commu_pls$coefficients
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
## , , 1 comps
## 
##         log(y)
## x1 0.113507380
## x2 0.082032632
## x3 0.003336848
## x4 0.418013296
## x5 0.200861317
## 
## , , 2 comps
## 
##        log(y)
## x1 0.94276552
## x2 0.46471809
## x3 0.03017342
## x4 0.18604186
## x5 0.07649677