Should I perform GridSearch (for tunning hyper parameters) before or after SMOTE?

Question

I am using an imbalanced data to perform classification with scikit-learn and to improve the model's accuracy, I created more synthetic data with the SMOTE technique. I want to know the best moment to realize the hyperparameter optimization with GridSearch. Should I use the original data only or the original+synthetic data?

Hi, and welcome to SO. For everyone else to understand and be able to help you, please consider editing your question. You could start by reading this article: [How do I ask a good question](https://stackoverflow.com/help/how-to-ask) and try reformulating the question. It will help others reproduce the problem and maybe find an answer. — pimarc, Oct 29 '19 at 19:21
Of course, use original+synthetic data, because that is all your training data of this model. — Jim Chen, Oct 30 '19 at 01:37

score 1 · Answer 1 · answered Oct 30 '19 at 17:48

Are you talking about how to use an oversampling method like SMOTE with sklearn's GridSearchCV specifically? I'm making this assumption as you have a scikit-learn tag on the post.

If so, you could use Pipeline objects to pass the oversampled SMOTE data into GridSearchCV. If you were looking to fit models with a cross-validation scheme via GridSearchCV, sklearn will automatically handle properly fitting/transforming each fold. See this answer here, which asks how to NOT apply SMOTE to validation folds:

Using Smote with Gridsearchcv in Scikit-learn

The imblearn package has a sklearn-like Pipeline specifically to deal with this, as the link above points out: https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.pipeline.Pipeline.html

Hard to know without seeing your code samples and what you're trying to do, but maybe this could help:

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

pipe = Pipeline(
    [('scaler', StandardScaler(copy=True),
    ('resample', SMOTE()),
    ('model', RandomForestClassifier()]
)

kf = StratifiedKFold(n_splits=5, shuffle=True)


p_grid = dict(model__n_estimators=[50,100,200])
grid_search = GridSearchCV(
    estimator=pipe, param_grid=p_grid, cv=kf, refit=True
)
grid_search.fit(X_train, y_train)

# Adding below in as could be helpful to know how to get fitted scaler if used
# best = grid_search.best_estimator_
# X_val_scaled = best['scaler'].transform(X_val)
# grid_search.predict(X_val_scaled)

Should I perform GridSearch (for tunning hyper parameters) before or after SMOTE?

1 Answers1