For detecting the cause of multicollinearity, you can simply check the correlation matrix (the first two lines in behzad.nouri's answer) to see which variables are highly correlated with each other (look for values close to 1).
Another alternative is to look at variance inflation factors (VIFs). statsmodels package reports VIF values as well. There is no standard threshold but VIF values greater than 4 are considered problematic.
import numpy as np
import statsmodels.stats.outliers_influence as oi
mean = [0, 0, 0]
cov = [[100, 90, 5], [90, 95, 10], [5, 10, 30]]
x, y, z = np.random.multivariate_normal(mean, cov, 1000).T
print np.corrcoef([x,y,z])
In the above code I've created three random variables x
, y
, and z
. The covariance between x
and y
is high, so if you print out the correlation matrix you will see that the correlation between these two variables is very high as well (0.931).
array([[ 1. , 0.93109838, 0.1051695 ],
[ 0.93109838, 1. , 0.18838079],
[ 0.1051695 , 0.18838079, 1. ]])
At this phase you can discard either x
or y
as the correlation between them is very high and using only one of them would be enough to explain most of the variation.
You can check the VIF values as well:
exog = np.array([x,y,z]).transpose()
vif0 = oi.variance_inflation_factor(exog, 0)
If you print out vif0
it will give you 7.21 for the first variable, which is a high number and indicative of high multicollinearity of the first variable with other variables.
Which one to exclude from the analysis (x
or y
) is up to you. You can check their standardized regression coefficients to see which one has a higher impact. You can also use techniques like ridge regression or lasso if you have multicollinearity problem. If you want to go deeper, I would suggest asking CrossValidated instead.