3

I have a classification task and want to use a repeated nested cross-validation to simultaneously perform hyperparameter tuning and feature selection. For this, I am running RandomizedSearchCV on RFECV using Python's sklearn library, as suggested in this SO answer.

However, I additionally need to scale my features and impute some missing values first. Those two steps should also be included into the CV framework to avoid information leakage between training and test folds. I tried to create a Pipeline to get there but I think it "destroys" my CV-nesting (i.e., performs the RFECV and random search separately from each other):

import numpy as np    
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFECV
import scipy.stats as stats
from sklearn.utils.fixes import loguniform
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline

# create example data with missings
Xtrain, ytrain = make_classification(n_samples = 500,
                                     n_features = 150,
                                     n_informative = 25,
                                     n_redundant = 125,
                                     random_state = 1897)
c = 10000 # number of missings
Xtrain.ravel()[np.random.choice(Xtrain.size, c, replace = False)] = np.nan # introduce random missings

folds = 5
repeats = 5
rskfold = RepeatedStratifiedKFold(n_splits = folds, n_repeats = repeats, random_state = 1897)
n_iter = 100

scl = StandardScaler()
imp = KNNImputer(n_neighbors = 5, weights = 'uniform')
sgdc = SGDClassifier(loss = 'log', penalty = 'elasticnet', class_weight = 'balanced', random_state = 1897)
sel = RFECV(sgdc, cv = folds)
pipe = Pipeline([('scaler', scl),
                 ('imputer', imp),
                 ('selector', sel),
                 ('clf', sgdc)])
param_rand = {'clf__l1_ratio': stats.uniform(0, 1),
              'clf__alpha': loguniform(0.001, 1)}
rskfold_search = RandomizedSearchCV(pipe, param_rand, n_iter = n_iter, cv = rskfold, scoring = 'accuracy', random_state = 1897, verbose = 1, n_jobs = -1)
rskfold_search.fit(Xtrain, ytrain)

Does anyone know how to include scaling and imputation into the CV framework without losing the nesting of my RandomizedSearchCV and RFECV?

Any help is highly appreciated!

RamsesII
  • 404
  • 3
  • 10

1 Answers1

1

You haven't lost the nested cv.

You have a search object at the top level; when you call fit, it splits the data into multiple folds. Let's focus on one such train fold. Your pipeline gets fitted on that, so you scale and impute, then the RFECV gets it to split into inner folds. Finally a new estimator gets fitted on the outer training fold, and scored on the outer testing fold.

That means the RFE is getting perhaps a little leakage, since scaling and imputing happen before its splits. You can add them in a pipeline before the estimator, and use that pipeline as the RFE estimator. And since RFECV refits its estimator using the discovered optimal number of features and exposes that for predict and so on, you don't really need the second copy of sgdc; using just the one copy has the side effect of hyperparameter-tuning the selection as well:

scl = StandardScaler()
imp = KNNImputer(n_neighbors=5, weights='uniform')
sgdc = SGDClassifier(loss='log', penalty='elasticnet', class_weight='balanced', random_state=1897)
base_pipe = Pipeline([
    ('scaler', scl),
    ('imputer', imp),
    ('clf', sgdc),
])
sel = RFECV(base_pipe, cv=folds)

param_rand = {'estimator__clf__l1_ratio': stats.uniform(0, 1),
              'estimator__clf__alpha': loguniform(0.001, 1)}
rskfold_search = RandomizedSearchCV(sel, param_rand, n_iter=n_iter, cv=rskfold, scoring='accuracy', random_state=1897, verbose=1, n_jobs=-1)
Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • thanks for your help! when I try this, I get the following error: `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` I think this is because the imputation is only applied in the inner CV but not in the outer CV? – RamsesII Nov 12 '21 at 11:41
  • @Ramsesll, right, sorry. I was thinking the RFE at transform would still be running the pipeline but that's not correct. It _does_ contain a fitted copy of the whole pipeline and can `predict`, which would impute and scale first, but it's `transform` doesn't. Using `base_pipe` as `clf` should do it (changing the hyperparameter names appropriately), or you can just add the steps into `pipe`. – Ben Reiniger Nov 12 '21 at 13:25
  • thank you again, but I still get the same `ValueError`...? As you suggested, I wrote `pipe = Pipeline([('sel', sel), ('clf', base_pipe)])` and `param_rand = {'clf__clf__l1_ratio': stats.uniform(0, 1), 'clf__clf__alpha': loguniform(0.001, 1)}` instead of the corresponding lines from your answer. Do you happen to have another idea? – RamsesII Nov 16 '21 at 08:21
  • I'm not sure why that wouldn't work, no. Can you provide a data snippet? I also realized you can lean on the `RFECV` refit some more, so the answer will be updated, but it mightn't/shouldn't affect the error. – Ben Reiniger Nov 16 '21 at 15:12
  • thank you a lot again! I've tried out your edited answer but I still get the same error... if you want to check: I've updated my question and added some example data – RamsesII Nov 19 '21 at 09:03
  • Thanks for the MWE! It appears to me to be a bug in sklearn: the rfe validates its data, and tries to avoid this issue by using the estimator tag `allow_nan` from its underlying estimator; however, in this case that's grabbing the `pipeline`'s tag, which does not copy from its (first) step, instead just skipping validation. So the `pipeline` does not `allow_nan`, and so the RFE also does not. – Ben Reiniger Nov 19 '21 at 21:23
  • @RamsesII, that should be raised as an issue on their github if it isn't already. If you'd rather not do it, I can. – Ben Reiniger Nov 19 '21 at 21:24
  • Thanks again @Ben Reiniger! If you think this is a bug, then it would be great if you could open an issue on their github. I think you have a better understanding of where exactly the error occurs – RamsesII Nov 22 '21 at 10:00
  • 1
    done: https://github.com/scikit-learn/scikit-learn/issues/21743 – Ben Reiniger Nov 22 '21 at 16:40