2

Can scikit-learn be used for removal of the features which are highly-correlated while using multiple linear regression?

With regard to the answer posted by @behzad.nouri to Capturing high multi-collinearity in statsmodels, I have some questions for avoiding my confusion.

So, he tested the high-multi-collinearity among 5 columns or features of independent variables; each column has 100 rows or data. He got that w[0] is near to zero. So can I say that first column or first independent variable should be removed for avoiding very high-multi-collinearity?

Community
  • 1
  • 1
Roman
  • 3,007
  • 8
  • 26
  • 54
  • Please edit your question title into something useful. How could somebody possibly find this by search with the title as it is. This doesn't really seem to be a programming question, either. It seems to be a statistical question and might be better asked somewhere else. – talonmies Nov 08 '15 at 13:57

1 Answers1

6

For detecting the cause of multicollinearity, you can simply check the correlation matrix (the first two lines in behzad.nouri's answer) to see which variables are highly correlated with each other (look for values close to 1).

Another alternative is to look at variance inflation factors (VIFs). statsmodels package reports VIF values as well. There is no standard threshold but VIF values greater than 4 are considered problematic.

import numpy as np
import statsmodels.stats.outliers_influence as oi
mean = [0, 0, 0]
cov = [[100, 90, 5], [90, 95, 10], [5, 10, 30]]
x, y, z = np.random.multivariate_normal(mean, cov, 1000).T
print np.corrcoef([x,y,z])

In the above code I've created three random variables x, y, and z. The covariance between x and y is high, so if you print out the correlation matrix you will see that the correlation between these two variables is very high as well (0.931).

array([[ 1.        ,  0.93109838,  0.1051695 ],
   [ 0.93109838,  1.        ,  0.18838079],
   [ 0.1051695 ,  0.18838079,  1.        ]])

At this phase you can discard either x or y as the correlation between them is very high and using only one of them would be enough to explain most of the variation.

You can check the VIF values as well:

exog = np.array([x,y,z]).transpose()
vif0 = oi.variance_inflation_factor(exog, 0)

If you print out vif0 it will give you 7.21 for the first variable, which is a high number and indicative of high multicollinearity of the first variable with other variables.

Which one to exclude from the analysis (x or y) is up to you. You can check their standardized regression coefficients to see which one has a higher impact. You can also use techniques like ridge regression or lasso if you have multicollinearity problem. If you want to go deeper, I would suggest asking CrossValidated instead.

Community
  • 1
  • 1
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • ok. so can you show me an exaample with codes for extracting the variables that causing multicollinearity using correlation matrix or VIFs approach? – Roman Nov 08 '15 at 14:24
  • ok, upvoted! so vif0 is for 1st variable? oi.variance_inflation_factor(exog, 1) gives for 2nd variable? – Roman Nov 08 '15 at 15:38
  • thats good! For corrcoef, how can i separate the coefficient between variable 1 and 2? – Roman Nov 08 '15 at 15:40
  • one last question, can you help me doing it with out patsy and panda, df = pd.DataFrame([x,y,z]).T; exog = dmatrix(df) – Roman Nov 08 '15 at 15:47
  • i mean just for `df = pd.DataFrame([x,y,z]).T; exog = dmatrix(df) ` – Roman Nov 08 '15 at 15:54