8

I am trying to combine recursive feature elimination and grid search in scikit-learn. As you can see from the code below (which works), I am able to get the best estimator from a grid search and then pass that estimator to RFECV. However, I would rather do the RFECV first, then the grid search. The problem is that when I pass the selector ​from RFECV to the grid search, it does not take it:

ValueError: Invalid parameter bootstrap for estimator RFECV

Is it possible to get the selector from RFECV and pass it directly to RandomizedSearchCV, or is this procedurally not the right thing to do?

from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=5, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0)

grid = {"max_depth": [3, None],
        "min_samples_split": sp_randint(1, 11),
        "min_samples_leaf": sp_randint(1, 11),
        "bootstrap": [True, False],
        "criterion": ["gini", "entropy"]}

estimator = RandomForestClassifierCoef()
clf = RandomizedSearchCV(estimator, param_distributions=grid, cv=7)
clf.fit(X, y)
estimator = clf.best_estimator_

selector = RFECV(estimator, step=1, cv=4)
selector.fit(X, y)
selector.grid_scores_
ldirer
  • 6,606
  • 3
  • 24
  • 30
Mark Conway
  • 106
  • 1
  • 7

1 Answers1

3

The best way to do this would be to nest the RFECV inside the random search, using the method from this SO answer. Some example code, based on the question code and the SO answer mentioned above:

from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint

# Build a classification task using 5 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=5, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0)

grid = {"estimator__max_depth": [3, None],
        "estimator__min_samples_split": sp_randint(1, 11),
        "estimator__min_samples_leaf": sp_randint(1, 11),
        "estimator__bootstrap": [True, False],
        "estimator__criterion": ["gini", "entropy"]}

estimator = RandomForestClassifier()
selector = RFECV(estimator, step=1, cv=4)
clf = RandomizedSearchCV(selector, param_distributions=grid, cv=7)
clf.fit(X, y)
print(clf.grid_scores_)
print(clf.best_estimator_.n_features_)
Community
  • 1
  • 1
hugo
  • 31
  • 4
  • is it possible to add multiple estimator into RFECV (similar to pipeline) to see which works the best ? in other words, instead of having one fixed RandomForest, say adding other estimators ? if so, can you update your answer. – Areza Feb 16 '19 at 12:27
  • To anyone arriving here, please see the comments in this answer about why it may not be good to join feature selection and hyperparameter tuning: https://stackoverflow.com/a/59301396/1379826 – Sos Aug 03 '21 at 13:04
  • does this work for regression too? I get: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. I used GradientBoostedRandomForestRegressor – Tims Jul 12 '23 at 21:36