13

I would like to perform recursive feature elimination with nested grid search and cross-validation for each feature subset using scikit-learn. From the RFECV documentation it sounds like this type of operation is supported using the estimator_params parameter:

estimator_params : dict

    Parameters for the external estimator. Useful for doing grid searches.

However, when I try to pass a grid of hyperparameters to the RFECV object

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=5, estimator_params={'C': [0.1, 10, 100, 1000]})
selector = selector.fit(X, y)

I get an error like

  File "U:/My Documents/Code/ModelFeatures/bin/model_rcc_gene_features.py", line 130, in <module>
    selector = selector.fit(X, y)
  File "C:\Python27\lib\site-packages\sklearn\feature_selection\rfe.py", line 336, in fit
    ranking_ = rfe.fit(X_train, y_train).ranking_
  File "C:\Python27\lib\site-packages\sklearn\feature_selection\rfe.py", line 146, in fit
    estimator.fit(X[:, features], y)
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 178, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 233, in _dense_fit
    max_iter=self.max_iter, random_seed=random_seed)
  File "libsvm.pyx", line 59, in sklearn.svm.libsvm.fit (sklearn\svm\libsvm.c:1628)
TypeError: a float is required

If anyone could show me what I'm doing wrong it would be greatly appreciated, thanks!

EDIT:

After Andreas' response things became clearer, below is a working example of RFECV combined with grid search.

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
param_grid = [{'C': 0.01}, {'C': 0.1}, {'C': 1.0}, {'C': 10.0}, {'C': 100.0}, {'C': 1000.0}, {'C': 10000.0}]
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=4)
clf = GridSearchCV(selector, {'estimator_params': param_grid}, cv=7)
clf.fit(X, y)
clf.best_estimator_.estimator_
clf.best_estimator_.grid_scores_
clf.best_estimator_.ranking_
DavidS
  • 2,344
  • 1
  • 17
  • 18
  • [Read this answer to avoid warning](http://stackoverflow.com/questions/31784392/how-can-i-avoid-using-estimator-params-when-using-rfecv-nested-within-gridsearch/35560648#35560648) – Paulo Alves Feb 22 '16 at 18:05

2 Answers2

13

Unfortunately, RFECV is limited to cross-validating the number of components. You can not search over the parameters of the SVM with it. The error is because SVC is expecting a float as C, and you gave it a list.

You can do one of two things: Run GridSearchCV on RFECV, which will result in splitting the data into folds two times (ones inside GridSearchCV and once inside RFECV), but the search over the number of components will be efficient, OR you could do GridSearchCV just on RFE, which would result in a single splitting of the data, but in very inefficient scanning of the parameters of the RFE estimator.

If you would like to make the docstring less ambiguous, a pull request would be welcome :)

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • 2
    Ok thanks for the help, much clearer now. I added an example of working grid search with RFECV to my original post for any others who might be struggling. Also, submitted a pull request with some revised documentation, hope it helps. – DavidS May 27 '14 at 15:53
  • I also thank you for the explanation, as I ran into the same problem. Has a pull request for documentation improvement been made yet? I would be happy to contribute if it hasn't been done yet. – J.B. Brown May 31 '14 at 05:17
  • @AndreasMueller Please let me know if you know an answer for this: https://stackoverflow.com/questions/55609339/how-to-perform-feature-selection-with-gridsearchcv-in-sklearn-in-python Thank you very much :) – EmJ Apr 11 '19 at 00:20
5

The code provided by DavidS did not work for me (sklearn 0.18), but required a small change to specific the param_grid, and its usage.

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
param_grid = [{'estimator__C': [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}]
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=4)
clf = GridSearchCV(selector, param_grid, cv=7)
clf.fit(X, y)
clf.best_estimator_.estimator_
clf.best_estimator_.grid_scores_
clf.best_estimator_.ranking_
musterschüler
  • 171
  • 2
  • 6