I have been trying to solve this problem in a similar constellation for some time until I got fed up by the complicated internal conversions of scikit-learn and decided to write my own rfecv working with pipelined transformers.
Basically I implemented the algorithm from Guyon, Isabelle, et al. "Gene selection for cancer classification using support vector machines." Machine learning 46.1 (2002): 389-422, that is also the base for the scikit-learn implementation.
def rfecv(X, y, estimator,
min_features_to_select=3,
splits=3,
step=3,
scoring_metric="f1",
scoring_decimals=3,
random_state=None):
"""
This method is an implementation the recursive feature eliminationalgorithm,
which eliminates unneccessary features. As scikit-learn only provides an RFECV
version [1] that makes using Pipelines very difficult, we have implemented our
own version based on the original paper [2].
[1] https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html
[2] Guyon, Isabelle, et al. "Gene selection for cancer classification using support vector machines."
Machine learning 46.1 (2002): 389-422.
:X: a DataFrame containing the features.
:y: a Series containing the labels.
:estimator: a scikit-learn estimator or a Pipeline. If a pipeline is passed,
the last element of the pipeline is assumed to be a classifier providing
a feature_importances_ attribute.
:min_features_to_select: the minimum number of features to evaluate.
:split: number of splits for to use for cross validation.
:step: the amount of features to be reduced during each step.
:scoring_metric: the scoring metric to use for evaluation (e.g., "f_one" or
a callable implementing the sklearn scoring interface).
:scoring_decimals: the scoring metric can be rounded to N decimals to avoid
the reduction from getting stuck with a larger number of features with
very small score gains. Defaults to 3 digits. If None is passed, full
scoring precision is used.
:random_state: if not None, this is the seed for all RNGs used in this function.
:returns: best_features, best_score, ranks, scores; best_features is a list
of features, best_score is the mean score achieved with these features over the
folds, ranks is the order of eliminated features (from most relevant to most irrelevant),
scores is the list of mean scores for each step achieved during the feature
elimination across all folds.
"""
# Initialize survivors and ranked list
survivors = list(X.columns)
ranks = []
scores = []
# While the survivor list is longer than min_features_to_select
while len(survivors) >= min_features_to_select:
# Get only the surviving features
X_tmp = X[survivors]
# Train and get the scores, cross_validate clones
# the model internally, so this does not modify
# the estimator passed to this function
print("[%.2f] evaluating %i features ..." % (time(), len(X_tmp.columns)))
cv_result = cross_validate(estimator, X_tmp, y,
cv=KFold(n_splits=splits,
shuffle=True,
random_state=random_state),
scoring=scoring_metric,
return_estimator=True)
# Append the mean performance to
score = np.mean(cv_result["test_score"])
if scoring_decimals is None:
scores.append(score)
else:
scores.append(round(score, scoring_decimals))
print("[%.2f] ... score %f." % (time(), scores[-1]))
# Get feature weights from the model fitted
# on the best fold and square the weights as described
# in the paper. If the estimator is a Pipeline,
# we get the weights from the last element.
best_estimator = cv_result["estimator"][np.argmax(cv_result["test_score"])]
if isinstance(best_estimator, Pipeline):
weights = best_estimator[-1].feature_importances_
else:
weights = best_estimator.feature_importances_
weights = list(np.power(weights, 2))
# Remove step features (but respect min_features_to_select)
for _ in range(max(min(step, len(survivors) - min_features_to_select), 1)):
# Find the feature with the smallest ranking criterion
# and update the ranks and survivors
idx = np.argmin(weights)
ranks.insert(0, survivors.pop(idx))
weights.pop(idx)
# Calculate the best set of surviving features
ranks_reverse = list(reversed(ranks))
last_max_idx = len(scores) - np.argmax(list(reversed(scores))) - 1
removed_features = set(ranks_reverse[0:last_max_idx * step])
best_features = [f for f in X.columns if f not in removed_features]
# Return ranks and scores
return best_features, max(scores), ranks, scores
Everything you need to know is documented in the docstring. The only exception is how to interprete the returned ranks and scores lists. In the case of step being 1, the score at scores[i] was achieved by removing all the features in list(reversed(ranks))[0:i] (as ranks is the list of removed features ordered from most relevant to most irrelevant).
A minimum working example with a DecisionTree is shown below (but it of course also works with Pipelines and transformers, if the classifier in the Pipeline is the last element):
Python 3.9.1 (default, Dec 11 2020, 14:32:07)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.tree import DecisionTreeClassifier
>>> test_data = load_breast_cancer(as_frame=True)
>>> clf = DecisionTreeClassifier(random_state=0)
>>> clf.fit(test_data.data, test_data.target)
DecisionTreeClassifier(random_state=0)
>>> best_features, best_score, _, _ = rfecv(test_data.data, test_data.target, clf, step=1, min_features_to_select=1, random_state=0)
[1626774242.35] evaluating 30 features ...
[1626774242.38] ... score 0.944000.
[1626774242.38] evaluating 29 features ...
[1626774242.42] ... score 0.938000.
[1626774242.42] evaluating 28 features ...
[1626774242.47] ... score 0.948000.
[1626774242.47] evaluating 27 features ...
[1626774242.50] ... score 0.934000.
[1626774242.51] evaluating 26 features ...
[1626774242.54] ... score 0.938000.
[1626774242.54] evaluating 25 features ...
[1626774242.58] ... score 0.939000.
[1626774242.58] evaluating 24 features ...
[1626774242.62] ... score 0.941000.
[1626774242.62] evaluating 23 features ...
[1626774242.65] ... score 0.944000.
[1626774242.65] evaluating 22 features ...
[1626774242.68] ... score 0.953000.
[1626774242.68] evaluating 21 features ...
[1626774242.70] ... score 0.940000.
[1626774242.70] evaluating 20 features ...
[1626774242.72] ... score 0.941000.
[1626774242.72] evaluating 19 features ...
[1626774242.75] ... score 0.943000.
[1626774242.75] evaluating 18 features ...
[1626774242.77] ... score 0.942000.
[1626774242.77] evaluating 17 features ...
[1626774242.79] ... score 0.944000.
[1626774242.79] evaluating 16 features ...
[1626774242.80] ... score 0.945000.
[1626774242.80] evaluating 15 features ...
[1626774242.82] ... score 0.935000.
[1626774242.82] evaluating 14 features ...
[1626774242.84] ... score 0.935000.
[1626774242.84] evaluating 13 features ...
[1626774242.86] ... score 0.947000.
[1626774242.86] evaluating 12 features ...
[1626774242.87] ... score 0.950000.
[1626774242.87] evaluating 11 features ...
[1626774242.89] ... score 0.950000.
[1626774242.89] evaluating 10 features ...
[1626774242.91] ... score 0.944000.
[1626774242.91] evaluating 9 features ...
[1626774242.92] ... score 0.948000.
[1626774242.92] evaluating 8 features ...
[1626774242.94] ... score 0.953000.
[1626774242.94] evaluating 7 features ...
[1626774242.95] ... score 0.953000.
[1626774242.95] evaluating 6 features ...
[1626774242.97] ... score 0.949000.
[1626774242.97] evaluating 5 features ...
[1626774242.98] ... score 0.951000.
[1626774242.98] evaluating 4 features ...
[1626774243.00] ... score 0.947000.
[1626774243.00] evaluating 3 features ...
[1626774243.01] ... score 0.950000.
[1626774243.01] evaluating 2 features ...
[1626774243.02] ... score 0.942000.
[1626774243.02] evaluating 1 features ...
[1626774243.03] ... score 0.911000.
>>> print(best_features, best_score)
['area error', 'smoothness error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst concavity', 'worst concave points'] 0.953
Regards,
Matthias