I am using an imbalanced data to perform classification with scikit-learn and to improve the model's accuracy, I created more synthetic data with the SMOTE technique. I want to know the best moment to realize the hyperparameter optimization with GridSearch. Should I use the original data only or the original+synthetic data?
-
Hi, and welcome to SO. For everyone else to understand and be able to help you, please consider editing your question. You could start by reading this article: [How do I ask a good question](https://stackoverflow.com/help/how-to-ask) and try reformulating the question. It will help others reproduce the problem and maybe find an answer. – pimarc Oct 29 '19 at 19:21
-
Of course, use original+synthetic data, because that is all your training data of this model. – Jim Chen Oct 30 '19 at 01:37
1 Answers
Are you talking about how to use an oversampling method like SMOTE with sklearn's GridSearchCV specifically? I'm making this assumption as you have a scikit-learn tag on the post.
If so, you could use Pipeline objects to pass the oversampled SMOTE data into GridSearchCV. If you were looking to fit models with a cross-validation scheme via GridSearchCV, sklearn will automatically handle properly fitting/transforming each fold. See this answer here, which asks how to NOT apply SMOTE to validation folds:
Using Smote with Gridsearchcv in Scikit-learn
The imblearn package has a sklearn-like Pipeline specifically to deal with this, as the link above points out: https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.pipeline.Pipeline.html
Hard to know without seeing your code samples and what you're trying to do, but maybe this could help:
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
pipe = Pipeline(
[('scaler', StandardScaler(copy=True),
('resample', SMOTE()),
('model', RandomForestClassifier()]
)
kf = StratifiedKFold(n_splits=5, shuffle=True)
p_grid = dict(model__n_estimators=[50,100,200])
grid_search = GridSearchCV(
estimator=pipe, param_grid=p_grid, cv=kf, refit=True
)
grid_search.fit(X_train, y_train)
# Adding below in as could be helpful to know how to get fitted scaler if used
# best = grid_search.best_estimator_
# X_val_scaled = best['scaler'].transform(X_val)
# grid_search.predict(X_val_scaled)

- 66
- 5