So I had to create a linear regression in python, but this dataset has over 800 columns. Is there anyway to see what columns are contributing most to the linear regression model? Thank you.
3 Answers
Look at the coefficients for each of the features. Ignore the sign of the coefficient:
- A large absolute value means the feature is heavily contributing.
- A value close to zero means the feature is not contributing much.
- A value of zero means the feature is not contributing at all.

- 1,004
- 5
- 12
-
Is there a way to get a list of say the top 20 columns is what I'm really asking I guess. – Matt Mar 05 '21 at 21:56
-
Take the coefficients, sort them by absolute value (largest first), then slice the list. – D Hudson Mar 06 '21 at 09:46
You can measure the correlation between each independent variable and dependent variable, for example:
corr(X1, Y)
corr(X2, Y)
.
.
.
corr(Xn, Y)
and you can test the model selecting the N most correlated variable.
There are more sophisticated methods to perform dimensionality reduction:
PCA (Principal Component Analysis) (https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c)
Forward Feature Construction
Use XGBoost in order to measure feature importance for each variable and then select the N most important variables (How to get feature importance in xgboost?)
There are many ways to perform this action and each one has pros and cons.
https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/

- 61
- 7
If you are just looking for variables with high correlation I would just do something like this
import pandas as pd
cols = df.columns
for c in cols:
# Set this to whatever you would like
if df['Y'].corr(df[c]) > .7:
print(c, df['Y'].corr(df[c]))
after you have decided what threshold/columns you want you can append c to a list

- 773
- 3
- 9