Use sklearn's GridSearchCV with a pipeline, preprocessing just once

Question

I'm using scickit-learn to tune a model hyper-parameters. I'm using a pipeline to have chain the preprocessing with the estimator. A simple version of my problem would look like this:

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

_ = grid.fit(X=np.random.rand(10, 3),
             y=np.random.randint(2, size=(10,)))

In my case the preprocessing (what would be StandardScale() in the toy example) is time consuming, and I'm not tuning any parameter of it.

So, when I execute the example, the StandardScaler is executed 12 times. 2 fit/predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of the parameter C, it returns the same output, so it'd be much more efficient, to compute it once, and then just run the estimator part of the pipeline.

I can manually split the pipeline between the preprocessing (no hyper parameters tuned) and the estimator. But to apply the preprocessing to the data, I should provide the training set only. So, I would have to implement the splits manually, and not use GridSearchCV at all.

Is there a simple/standard way to avoid repeating the preprocessing while using GridSearchCV?

https://scikit-learn.org/stable/modules/compose.html – AnandJ Aug 22 '22 at 20:23 — AnandJ, Aug 22 '22 at 20:23

Vivek Kumar · Accepted Answer · 2020-04-16T05:43:54.983

55

Update: Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV will tune the hyperparameters on the data already preprocessed by StandardScaler, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.

Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.

So instead of:

grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

Do this:

clf = make_pipeline(StandardScaler(), 
                    GridSearchCV(LogisticRegression(),
                                 param_grid={'logisticregression__C': [0.1, 10.]},
                                 cv=2,
                                 refit=True))

clf.fit()
clf.predict()

What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.

Edit:

Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:

refit : boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance after fitting.

If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit(). When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().

So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.

edited Apr 16 '20 at 05:43

answered Apr 12 '17 at 10:21

Vivek Kumar

35,217
8
109
132

4

I didn't think about using GridSearchCV in the pipe itself, sounds like a brilliant idea. Thanks a lot! – Marc Garcia Apr 12 '17 at 12:36
1

@MarcGarcia But do make sure to turn the `refit=True`, else it will throw an error, when calling `clf.predict()` – Vivek Kumar Apr 12 '17 at 13:12
@MarcGarcia Edited the answer to reflect the same – Vivek Kumar Apr 12 '17 at 13:22
11

Doesn't this technique use all the data in the StandardScalar() instead of just the training set ? I don't see how it allows to avoid doing the splits manually. – Victor Deplasse May 12 '17 at 23:51
@imad3v No. It will only use to set the scales according to data given in fit(). And use those scales to scale the data given in predict(), not fit() on that data. Hope you get the point. Please ask if not. – Vivek Kumar May 13 '17 at 01:36
14

@VivekKumar Ok I see that. But then during the fit(), GridSearchCV will tune the hyperparameter by a CV on the data preprocessed by StandardScaler(), so StandardScalar() will also be fitted on the validation set of GridSearchCV (not the test set passed to predict()), which isn't correct for me because the validation set shouldn't be preprocessed. – Victor Deplasse May 13 '17 at 08:25
@VictorDeplasse Yes, I get your point. That is one caveat of using this approach. Thanks. I will update the answer for it. – Vivek Kumar May 14 '17 at 06:39
@VivekKumar I tried the above solution for svc in the following manner: param_grid = {'SVC__C': [0.01, 0.1, 1],'SVC__gamma': [0.001, 0.01, 0.1, 1]} pipe = make_pipeline(Normalizer(), GridSearchCV(SVC(), param_grid = param_grid, cv=10, refit=True)) pipe.fit(X_train,y_train) gives the following error: ValueError: Invalid parameter SVC for estimator SVC Can you tell me how I can change the param_grid as I think that's where the problem is? – Shashwat Siddhant Dec 01 '19 at 17:13
1

@ShashwatSiddhant `param_grid` in your case goes inside the `GridSearchCV`. It has nothing to do with `make_pipeline` here. So in your case, `param_grid` should only contain `'C'` and `'gamma'`. – Vivek Kumar Dec 02 '19 at 09:46
1

What would happen if we pass the Pipeline `memory` parameter instead? – gented Jan 30 '20 at 09:13
@gented I'm sorry but I could not understand. Please describe in detail and if possible post a new question. – Vivek Kumar Jan 30 '20 at 12:32
@VivekKumar `sklearn.pipeline.Pipeline` possesses a `memory` parameter that can be specify to cache the fitted transformers. I was wondering if that could be used to cache the fitted pipeline given each combination of hyper-parameters, instead of passing `GridSearchCV` _inside_ the pipeline, to avoid running into the problem of validation folds still being fit on. – gented Jan 30 '20 at 12:52
@gented Ah ok. I understand now. Yes that can be done. But at the time of writing this answer it was not in stable scikit build I think. And there were some issues in how the pipeline will optimize them. See the answer below. – Vivek Kumar Jan 30 '20 at 12:56
Does this approach work for anyone? I am getting some unexpected results... – lightbox142 Apr 15 '20 at 22:48
1

@teter123f Which approach are you talking about? The one present in answer or the one discussed in comments? – Vivek Kumar Apr 16 '20 at 05:34
@VivekKumar The one accepted as the answer. Although, i think it might be working. I just thought that it doesn't work because Victor Deplasse's answer below discusses github issues. – lightbox142 Apr 16 '20 at 06:41
@VivekKumar for instance, I ran a make_pipeline with a couple feature transformations and then i had a gridsearchCV for the last extratreesregressor estimator. Training took quite some time - as expected - but i get a prediction R-squared that is much lower than the R-squared i get using a model I manually built that has the same hyperparameters as one set inside the GridSearchCV. Additionally, my Pipeline object says the final estimator is GridSearchCV instead of ExtratreesRegressor. – lightbox142 Apr 17 '20 at 01:04
1

@VivekKumar If we don't want a data leak we should not use this approach of placing Gridsearch inside a pipeline – megjosh Nov 27 '20 at 11:42
@megjosh Yes, I agree. This I have already mentioned on top of the answer. – Vivek Kumar Nov 27 '20 at 20:12
The alternate strategy would be perform hyperparameter tuning seperately using grid search and Cross validation. Get the best parameters 2. Create pipeline(pln) with scaler and classifier(mine logistic regression). 3. Pass best parameters to clasifier in pln. 4. Pln.fit(train,y) 5. pred=pln.predict(test) 6.proba=pln.prepdproba(tst) 7. rocauc = roc_auc_score(pred,proba) Hopefully rocauc is not 1 in which case it will not denote a data leak. I am passing pipeline to Gridsearch and getting rocaucscore of 1 which is what I am trying to solve now. – megjosh Nov 29 '20 at 07:58
This is not generally the proper way to do it. Instead, the pipeline needs to go into GridSearchCV. See this paper for an explanation why your approach can be problematic, e.g. in the case of resampling: https://www.researchgate.net/publication/328315720_Cross-Validation_for_Imbalanced_Datasets_Avoiding_Overoptimistic_and_Overfitting_Approaches – Jonathan Dec 05 '20 at 20:01
what about this https://scikit-learn.org/stable/modules/compose.html? – AnandJ Aug 22 '22 at 20:24
@AnandJ Are you talking about Caching in the linked page? – Vivek Kumar Aug 23 '22 at 10:09

score 14 · Answer 2 · answered Mar 28 '19 at 15:31

For those who stumbled upon a little bit different problem, that I had as well.

Suppose you have this pipeline:

classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=SEED, n_jobs=-1))])

Then, when specifying parameters you need to include this 'clf_' name that you used for your estimator. So the parameters grid is going to be:

params={'clf__max_features':[0.3, 0.5, 0.7],
        'clf__min_samples_leaf':[1, 2, 3],
        'clf__max_depth':[None]
        }

score 4 · Answer 3 · answered May 13 '17 at 11:14

4

It is not possible to do this in the current version of scikit-learn (0.18.1). A fix has been proposed on the github project:

https://github.com/scikit-learn/scikit-learn/issues/8830

https://github.com/scikit-learn/scikit-learn/pull/8322

answered May 13 '17 at 11:14

Victor Deplasse

682
6
10

score 0 · Answer 4 · answered Jul 07 '23 at 17:32

I joined the party late, but I brought a new solution/insight using Pipeline():

sub-pipeline containing your model (regression/classifier) as a single component
main pipeline made of routine components:
- pre-processing component e.g., scaler, dimension reduction, etc.
- your refitted GridSearchCV(regressor, param) with desired/best params for your model (Note: don't forget to refit=True) based on @Vivek Kumar remark ref

#build an end-to-end pipeline, and supply the data into a regression model and train and fit within the main pipeline.
#It avoids leaking the test\val-set into the train-set
# Create the sub-pipeline

#create and train the sub-pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

sgd_subpipeline = Pipeline(steps=[#('scaler', MinMaxScaler()), # better to not rescale internally
                                  ('SGD',    SGDRegressor(random_state=0)),
])

# Define the hyperparameter grid
param_grid = {
    'SGD__loss':     ['squared_error', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'huber'],
    'SGD__penalty':  ['l2', 'l1', 'elasticnet'],
    'SGD__alpha':    [0.0001, 0.001, 0.01],
    'SGD__l1_ratio': [0.15, 0.25, 0.5]
}

# Perform grid search
grid_search = GridSearchCV(sgd_subpipeline, param_grid, cv=5, n_jobs=-1, verbose=True, refit=True)
grid_search.fit(X_train, y_train)

# Get the best model
best_sgd_reg = grid_search.best_estimator_

# Print the best hyperparameters
print('=========================================[Best Hyperparameters info]=====================================')
print(grid_search.best_params_)

# summarize best
print('Best MAE: %.3f'  % grid_search.best_score_)
print('Best Config: %s' % grid_search.best_params_)
print('==========================================================================================================')

# Create the main pipeline by chaining refitted GridSerachCV sub-pipeline

sgd_pipeline = Pipeline(steps=[('scaler', MinMaxScaler()), # better to rescale externally
                               ('SGD',    grid_search),
])

# Fit the best model on the training data within pipeline (like fit any model/transformer): pipe.fit(traindf[features], traindf[labels]) #X, y

sgd_pipeline.fit(X_train, y_train)

#--------------------------------------------------------------
# Displaying a Pipeline with a Preprocessing Step and Regression
from sklearn import set_config
set_config(display="text")

Alternatively, you can use TransformedTargetRegressor (specifically if you need to descale y as @mloning commented here) and chain this component, including your regression model ref. Note:

you don't need to set transform argument unless you need descaling; please then check to related posts 1, 2, 3, 4, its score
Pay attention to this remark about not scaling here since:

... With scaling y you actually lose your units....

Here, It is recommended to:

... Do the transformation outside the pipeline. ...

#build an end-to-end pipeline, and supply the data into a regression model and train and fit within main pipeline.
#It avoids leaking the test\val-set into the train-set
# Create the sub-pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

sgd_subpipeline = Pipeline(steps=[#('scaler', MinMaxScaler()), # better to not rescale internally
                                  ('SGD',    SGDRegressor(random_state=0)),
])

# Define the hyperparameter grid
param_grid = {
    'SGD__loss':     ['squared_error', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'huber'],
    'SGD__penalty':  ['l2', 'l1', 'elasticnet'],
    'SGD__alpha':    [0.0001, 0.001, 0.01],
    'SGD__l1_ratio': [0.15, 0.25, 0.5]
}

# Perform grid search
grid_search = GridSearchCV(sgd_subpipeline, param_grid, cv=5, n_jobs=-1, verbose=True, refit=True)
grid_search.fit(X_train, y_train)

# Get the best model
best_sgd_reg = grid_search.best_estimator_

# Print the best hyperparameters
print('=========================================[Best Hyperparameters info]=====================================')
print(grid_search.best_params_)

# summarize best
print('Best MAE: %.3f'  % grid_search.best_score_)
print('Best Config: %s' % grid_search.best_params_)
print('==========================================================================================================')



# Create the main pipeline using sub-pipeline made of TransformedTargetRegressor component
from sklearn.compose import TransformedTargetRegressor

TTR_sgd_pipeline = Pipeline(steps=[('scaler', MinMaxScaler()), # better to rescale externally
                                   #('SGD', SGDRegressor()),
                                    ('TTR', TransformedTargetRegressor(regressor= grid_search, #SGDRegressor(),
                                                                       #transformer=MinMaxScaler(),
                                                                       #func=np.log,
                                                                       #inverse_func=np.exp,
                                                                       check_inverse=False))
])



# Fit the best model on the training data within pipeline (like fit any model/transformer): pipe.fit(traindf[features], traindf[labels]) #X, y
#best_sgd_pipeline.fit(X_train, y_train)
TTR_sgd_pipeline.fit(X_train, y_train)

#--------------------------------------------------------------
# Displaying a Pipeline with a Preprocessing Step and Regression
from sklearn import set_config
set_config(display="diagram")

Use sklearn's GridSearchCV with a pipeline, preprocessing just once

4 Answers4

Linked