0

So I had to create a linear regression in python, but this dataset has over 800 columns. Is there anyway to see what columns are contributing most to the linear regression model? Thank you.

Matt
  • 1

3 Answers3

0

Look at the coefficients for each of the features. Ignore the sign of the coefficient:

  • A large absolute value means the feature is heavily contributing.
  • A value close to zero means the feature is not contributing much.
  • A value of zero means the feature is not contributing at all.
D Hudson
  • 1,004
  • 5
  • 12
0

You can measure the correlation between each independent variable and dependent variable, for example:

corr(X1, Y)
corr(X2, Y)
.
.
.
corr(Xn, Y)

and you can test the model selecting the N most correlated variable.

There are more sophisticated methods to perform dimensionality reduction:

There are many ways to perform this action and each one has pros and cons.

https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/

0

If you are just looking for variables with high correlation I would just do something like this

import pandas as pd
cols = df.columns

for c in cols:
    # Set this to whatever you would like
    if df['Y'].corr(df[c]) > .7:
        print(c, df['Y'].corr(df[c]))

after you have decided what threshold/columns you want you can append c to a list

fthomson
  • 773
  • 3
  • 9