6

I have split my data into train/test before doing cross-validation on the training data to validate my hyperparameters. I have an unbalanced dataset and want to perform SMOTE oversampling on each iteration, so I have established a pipeline using imblearn.

My understanding is that oversampling should be done after dividing the data into k-folds to prevent information leaking. Is this order of operations (data split into k-folds, k-1 folds oversampled, predict on remaining fold) preserved when using Pipeline in the setup below?

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', xgb.XGBClassifier())
    ])


param_dist = {'classification__n_estimators': stats.randint(50, 500),
              'classification__learning_rate': stats.uniform(0.01, 0.3),
              'classification__subsample': stats.uniform(0.3, 0.6),
              'classification__max_depth': [3, 4, 5, 6, 7, 8, 9],
              'classification__colsample_bytree': stats.uniform(0.5, 0.5),
              'classification__min_child_weight': [1, 2, 3, 4],
              'sampling__ratio': np.linspace(0.25, 0.5, 10)
             }

random_search = RandomizedSearchCV(model,
                                   param_dist,
                                   cv=StratifiedKFold(n_splits=5),
                                   n_iter=10,
                                   scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
TomNash
  • 3,147
  • 2
  • 21
  • 57

1 Answers1

3

Your understanding is right. When you feed the pipeline as model, the training data (k-1) is applied using .fit() and testing is done on the kth fold. Then sampling would be done on the training data.

The documentation for imblearn.pipeline .fit() says:

Fit the model

Fit all the transforms/samplers one after the other and transform/sample the data, then fit the transformed/sampled data using the final estimator.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • HI @Venkatachalam. I have just found your answer to TomNash 's question. I am currently having a problem with oversampling and pipelines (I am using different preprocessors). In case you mightt want to have a look: https://stackoverflow.com/questions/67493509/pre-processing-text-categorical-and-numerical-variables-and-pipelines . I guess that there would be a more efficient way to run oversampling (I do not know if my approach is wrong in somehow). If you could have a look, I would really appreciate it. Thanks – Math May 12 '21 at 10:32