0

This minitab blog says not to use regular regression coefficients or p-values to determine variable/feature importance. It says to use standardized regression coefficients or changes in R-² instead.

I wanted to find out how to calculate standardized regression coefficients in Python and found this old SO question with this code answer.

import statsmodels.api as sm
from scipy.stats.mstats import zscore

print(sm.OLS(zscore(y), zscore(x)).fit().summary())

Example1 - Using my lasso model

lasso_model = Lasso(alpha = 0.01)    
selected_columns = list(X.columns)
lasso_model.fit(X, y)
list(zip(selected_columns, lasso_model.coef_))

[Out]:

[('AGE', -0.00013116073118093452),
 ('DISTANCE', 2.2924058071269675e-05),
 ('TOTDELAY', -0.0002569583237660659),
 ('STD/STA', -0.01334152988447677),
 ('WEBRSVN', -0.020335870566292973),
 ('WEBCI', 0.08571155146491262),
 ('AIRPTCI', 0.0327097907845398)...]

Example2 - Using statsmodels example from SO

import statsmodels.api as sm
from scipy.stats.mstats import zscore

sm.OLS(zscore(y), zscore(X)).fit().summary()

[Out]:

               coef   std err         t    P>|t|    [0.025    0.975]
AGE         -0.0297     0.002   -14.843    0.000    -0.034    -0.026
DISTANCE    -0.0005     0.003    -0.181    0.856    -0.006     0.005
TOTDELAY    -0.0391     0.002   -20.945    0.000    -0.043    -0.035
STD/STA     -0.0528     0.002   -22.003    0.000    -0.058    -0.048
WEBRSVN     -0.1155     0.003   -37.834    0.000    -0.121    -0.110
WEBCI        0.2147     0.003    81.803    0.000     0.210     0.220
AIRPTCI      0.0958     0.002    46.530    0.000     0.092     0.100
...
  1. Is this the correct method or is there a more recent way using a python library since that post in 2015? This is done only before model selection or can it be done iteratively?
  2. The blog says to use the absolute value (since there can be -+ correlation) to determine if a predictor is important or not. What is the standard coef value range used to select important predictors? None of my coef are larger than 0.3 so what does that indicate?

This statsexchange answer says to select predictors make sure you have normalized the data before you perform regression and you take absolute value of coefficients. You can also look at the change in the R-squared value.

Does that mean I only have to normalize my data then run my lasso model and I can then use those coef?

Is this the purpose of StandardScaler()?

std = StandardScaler()
std.fit(X.values)
X_tr = std.transform(X.values)
Edison
  • 11,881
  • 5
  • 42
  • 50
  • it's not clear what you want to achieve. Indeed, in most cases, you need to normalize with variables before training the model. But this is not a prerequisite – padu Jun 24 '22 at 12:08
  • I want to select predictors using standardized regression coef's as per the title. – Edison Jun 24 '22 at 12:12
  • I recommend that you familiarize yourself with the topic of selecting functions:https://scikit-learn.org/stable/modules/feature_selection.html – padu Jun 24 '22 at 12:20
  • Thanks. Looks good. I was reading about `Pipeline` earlier. For now I was hoping to get some ideas on how to get standardized coef's using standardScaler or other method like Normalizer. Some other SO answers say not to do any of that before test/train split as well. So that's another concern. – Edison Jun 24 '22 at 12:52
  • @padu That page mentions `L1-based feature selection` which I was doing already with Lasso. Problem is, I was told that those coef's are regular regression coef's and not standardized regression coef's therefore not reliable for determining variable importance. – Edison Jun 24 '22 at 12:56

0 Answers0