I have a classification task and want to use a repeated nested cross-validation to simultaneously perform hyperparameter tuning and feature selection. For this, I am running RandomizedSearchCV
on RFECV
using Python's sklearn
library, as suggested in this SO answer.
However, I additionally need to scale my features and impute some missing values first. Those two steps should also be included into the CV framework to avoid information leakage between training and test folds. I tried to create a Pipeline to get there but I think it "destroys" my CV-nesting (i.e., performs the RFECV and random search separately from each other):
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFECV
import scipy.stats as stats
from sklearn.utils.fixes import loguniform
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
# create example data with missings
Xtrain, ytrain = make_classification(n_samples = 500,
n_features = 150,
n_informative = 25,
n_redundant = 125,
random_state = 1897)
c = 10000 # number of missings
Xtrain.ravel()[np.random.choice(Xtrain.size, c, replace = False)] = np.nan # introduce random missings
folds = 5
repeats = 5
rskfold = RepeatedStratifiedKFold(n_splits = folds, n_repeats = repeats, random_state = 1897)
n_iter = 100
scl = StandardScaler()
imp = KNNImputer(n_neighbors = 5, weights = 'uniform')
sgdc = SGDClassifier(loss = 'log', penalty = 'elasticnet', class_weight = 'balanced', random_state = 1897)
sel = RFECV(sgdc, cv = folds)
pipe = Pipeline([('scaler', scl),
('imputer', imp),
('selector', sel),
('clf', sgdc)])
param_rand = {'clf__l1_ratio': stats.uniform(0, 1),
'clf__alpha': loguniform(0.001, 1)}
rskfold_search = RandomizedSearchCV(pipe, param_rand, n_iter = n_iter, cv = rskfold, scoring = 'accuracy', random_state = 1897, verbose = 1, n_jobs = -1)
rskfold_search.fit(Xtrain, ytrain)
Does anyone know how to include scaling and imputation into the CV framework without losing the nesting of my RandomizedSearchCV
and RFECV
?
Any help is highly appreciated!