This minitab blog says not to use regular regression coefficients or p-values to determine variable/feature importance. It says to use standardized regression coefficients or changes in R-² instead.
I wanted to find out how to calculate standardized regression coefficients in Python and found this old SO question with this code answer.
import statsmodels.api as sm
from scipy.stats.mstats import zscore
print(sm.OLS(zscore(y), zscore(x)).fit().summary())
Example1 - Using my lasso model
lasso_model = Lasso(alpha = 0.01)
selected_columns = list(X.columns)
lasso_model.fit(X, y)
list(zip(selected_columns, lasso_model.coef_))
[Out]:
[('AGE', -0.00013116073118093452),
('DISTANCE', 2.2924058071269675e-05),
('TOTDELAY', -0.0002569583237660659),
('STD/STA', -0.01334152988447677),
('WEBRSVN', -0.020335870566292973),
('WEBCI', 0.08571155146491262),
('AIRPTCI', 0.0327097907845398)...]
Example2 - Using statsmodels example from SO
import statsmodels.api as sm
from scipy.stats.mstats import zscore
sm.OLS(zscore(y), zscore(X)).fit().summary()
[Out]:
coef std err t P>|t| [0.025 0.975]
AGE -0.0297 0.002 -14.843 0.000 -0.034 -0.026
DISTANCE -0.0005 0.003 -0.181 0.856 -0.006 0.005
TOTDELAY -0.0391 0.002 -20.945 0.000 -0.043 -0.035
STD/STA -0.0528 0.002 -22.003 0.000 -0.058 -0.048
WEBRSVN -0.1155 0.003 -37.834 0.000 -0.121 -0.110
WEBCI 0.2147 0.003 81.803 0.000 0.210 0.220
AIRPTCI 0.0958 0.002 46.530 0.000 0.092 0.100
...
- Is this the correct method or is there a more recent way using a python library since that post in 2015? This is done only before model selection or can it be done iteratively?
- The blog says to use the absolute value (since there can be -+ correlation) to determine if a predictor is important or not. What is the standard coef value range used to select important predictors? None of my coef are larger than 0.3 so what does that indicate?
This statsexchange answer says to select predictors make sure you have normalized the data before you perform regression and you take absolute value of coefficients. You can also look at the change in the R-squared value.
Does that mean I only have to normalize my data then run my lasso model and I can then use those coef?
Is this the purpose of StandardScaler()
?
std = StandardScaler()
std.fit(X.values)
X_tr = std.transform(X.values)