I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv. To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv. My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do. The validation set should not be oversampled. Am I right that the whole pipeline will be applied to both dataset splits? And if yes, how can I turn around this? Thanks a lot in advance
Asked
Active
Viewed 2.3k times
1 Answers
57
Yes, it can be done, but with imblearn Pipeline.
You see, imblearn has its own Pipeline to handle the samplers correctly. I described this in a similar question here.
When called predict()
on a imblearn.Pipeline
object, it will skip the sampling method and leave the data as it is to be passed to next transformer.
You can confirm that by looking at the source code here:
if hasattr(transform, "fit_sample"):
pass
else:
Xt = transform.transform(Xt)
So for this to work correctly, you need the following:
from imblearn.pipeline import Pipeline
model = Pipeline([
('sampling', SMOTE()),
('classification', LogisticRegression())
])
grid = GridSearchCV(model, params, ...)
grid.fit(X, y)
Fill the details as necessary, and the pipeline will take care of the rest.

Vivek Kumar
- 35,217
- 8
- 109
- 132
-
1Thanks a lot! Does sklearn.pipeline.Pipeline work too for this purpose? – Ehsan M May 11 '18 at 04:48
-
9@EhsanM No. As I said above, sklearn.pipeline.Pipeline will not handle the `sample()` method of SMOTE, but imblearn.pipeline.Pipeline will. – Vivek Kumar May 11 '18 at 05:02
-
@VivekKumar - Using `imblearn.pipeline.Pipeline` with `GridSearchCV` is resulting in an error. The `GridSearchCV` is not able to recognize the estimator's (`LogisticRegression`) parameters and tries to the param to the `Pipeline` itself. Any suggestions? – Krishnang K Dalal Nov 08 '19 at 08:23
-
@KrishnangKDalal Please post a new question with your code and notify me – Vivek Kumar Nov 08 '19 at 13:00
-
Hi @VivekKumar, I have created a new question with my implementation. Here's the link: https://stackoverflow.com/questions/58815016/cross-validating-with-imblearn-pipeline-and-gridsearchcv – Krishnang K Dalal Nov 12 '19 at 08:50
-
@agent18, I have updated the links. Thanks for notifying – Vivek Kumar Jan 12 '21 at 07:54