My question is about the unnecessary predictors, namely the variables that do not provide any new linear information or the variables that are linear combinations of the other predictors. As you can see the swiss
dataset has six variables.
library(swiss)
names(swiss)
# "Fertility" "Agriculture" "Examination" "Education"
# "Catholic" "Infant.Mortality"
Now I introduce a new variable ec
. It is the linear combination of Examination
and Education
.
ec <- swiss$Examination + swiss$Catholic
When we run a linear regression with unnecessary variables, R drops terms that are linear combinations of other terms and returns NA
as their coefficients. The command below illustrates the point perfectly.
lm(Fertility ~ . + ec, swiss)
Coefficients:
(Intercept) Agriculture Examination Education
66.9152 -0.1721 -0.2580 -0.8709
Catholic Infant.Mortality ec
0.1041 1.0770 NA
However, when we regress first on ec
and then all of the regressors as shown below,
lm(Fertility ~ ec + ., swiss)
Coefficients:
(Intercept) ec Agriculture Examination
66.9152 0.1041 -0.1721 -0.3621
Education Catholic Infant.Mortality
-0.8709 NA 1.0770
I would expect the coefficients of both Catholic
and Examination
to be NA
. The variable ec
is linear combination of both of them but in the end the coefficient of Examination
is not NA
whereas that of the Catholic
is NA
.
Could anyone explain the reason of that?