Top features of linear regression in python

Question

So I had to create a linear regression in python, but this dataset has over 800 columns. Is there anyway to see what columns are contributing most to the linear regression model? Thank you.

This really isn't a programming question, and better suited for [statistics.se] — juanpa.arrivillaga, Mar 05 '21 at 21:24

score 0 · Answer 1 · answered Mar 05 '21 at 21:23

0

Look at the coefficients for each of the features. Ignore the sign of the coefficient:

A large absolute value means the feature is heavily contributing.
A value close to zero means the feature is not contributing much.
A value of zero means the feature is not contributing at all.

answered Mar 05 '21 at 21:23

D Hudson

1,004
5
12

Is there a way to get a list of say the top 20 columns is what I'm really asking I guess. – Matt Mar 05 '21 at 21:56
Take the coefficients, sort them by absolute value (largest first), then slice the list. – D Hudson Mar 06 '21 at 09:46

score 0 · Answer 2 · answered Mar 05 '21 at 21:26

You can measure the correlation between each independent variable and dependent variable, for example:

corr(X1, Y)
corr(X2, Y)
.
.
.
corr(Xn, Y)

and you can test the model selecting the N most correlated variable.

There are more sophisticated methods to perform dimensionality reduction:

PCA (Principal Component Analysis) (https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c)
Forward Feature Construction
Use XGBoost in order to measure feature importance for each variable and then select the N most important variables (How to get feature importance in xgboost?)

There are many ways to perform this action and each one has pros and cons.

https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/

score 0 · Answer 3 · answered Mar 05 '21 at 22:20

If you are just looking for variables with high correlation I would just do something like this

import pandas as pd
cols = df.columns

for c in cols:
    # Set this to whatever you would like
    if df['Y'].corr(df[c]) > .7:
        print(c, df['Y'].corr(df[c]))

after you have decided what threshold/columns you want you can append c to a list

Top features of linear regression in python

3 Answers3