1

I took an online course where the instructor explained backward elimination using a dataset(50,5) where you eliminate the columns manually by looking at their p-values.

 import statsmodels.api as sm
 X = np.append(arr = np.ones((2938, 1)).astype(int), values = X, axis = 1)
 X_opt = X[:, [0,1,2,3,4,5]]
 regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
 regressor_OLS.summary()

 # Second Step
 X_opt = X[:, [0,1,,3,4,5]]
 regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
 regressor_OLS.summary() 
 # and so on

Now while practicing on on an large dataset such as (2938, 214) which I have, do I have to eliminate all the columns myself? Because that is a lot of work, or is there some sort of algorithm or way to do it.

This might be a stupid question but I am a begineer in machine learning so any help is appreciated.Thanks

Prajwal
  • 65
  • 1
  • 8
  • A better approach is to apply PCA (Principal Component Analysis) to your `m` predictors and only keep the `n` most significant new features that you get. Give a look at [this answer](https://stackoverflow.com/questions/13224362/principal-component-analysis-pca-in-python) – Victor Deleau Jan 30 '20 at 17:05
  • @VictorDeleau It depends, his features might (and probably do) have problems like multicollinearity, low feature variance or bear no discriminative information for his task. PCA only creates basis which explains intrinsic data variance. Feature selection is important first step at least and shouldn't be hand-waved. – Szymon Maszke Jan 30 '20 at 17:54

1 Answers1

3

What you sre trying to do is called "Recursive Feature Eliminatio ", RFE for short.

Example from sklearn.feature_selection.RFE:

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE 
from sklearn.svm import SVR 

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, 5, step=1)
selector = selector.fit(X, y)

This would eliminate features using SVR one by one until only 5 most important are left. You could use any algorithm which provides feature_importances_ object member.

When it comes to p-values you could eliminate all greater than threshold (provided the null hypothesis is this coefficient has no meaning, e.g. is zero), but see below.

Just remember, usually coefficients weights will change as some of them are removed (as here or in RFE), so it's only an approximation dependent on many factors. You could do other preprocessing like removing correlated features or using OLS with L1 penalty which will choose only the most informative factors.

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83