1

I'm trying to use cross_validate function and SMOTE function together in classification problem and I want to know how do it correctly.

This is simple function I use to call cross_validation in machine learning classification algorithm:

def bayes(dataIn, dataOut, cv, statistic):    
    # trainning method
    naive_bayes = GaussianNB()

    # applying the method
    outputBayes = cross_validate(estimator = naive_bayes, 
                                 X = dataIn, y = dataOut, 
                                 cv = cv, scoring = statistic)

    return outputBayes

I acessed cross_validate documentation to search if I could determine trainning dataset and testing dataset before call cross_validate function and not send complete dataInput and dataOutput. Because I want to use SMOTE function, and to do it, I need to separate dataset before do cross validation. If I use SMOTE in across dataset, results will be skewed.

How can I solve it? I should do my cross validation function? I do not want to do, cause cross_validate function return is very good to be used and I do not see how to do exactly same return.

I saw other questions about it, but I did not find that specific question:

SMOTE oversampling and cross-validation

Function for cross validation and oversampling (SMOTE)

Does oversampling happen before or after cross-validation using imblearn pipelines?

1 Answers1

0

The third link actually describes what you want. Given the results from this article, oversampling should be done on each fold in the cross-validation procedure. This process is done when using the IMBLearn package and pipeline. The process would be to use that package, and just specify your oversampling technique (SMOTE) and the model (GaussianNB()). A quick adaptation of the code from the third link shows roughly what you want.

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),  # this is the oversampling process
        ('classification', GaussianNB()) . # this is where to specify the model
    ])


param_dist = {...[REVIEW DOCUMENTATION FOR CORRECT SET OF PARAMS]
             }

random_search = RandomizedSearchCV(model,
                                   param_dist,
                                   cv=StratifiedKFold(n_splits=5),
                                   n_iter=10,
                                   scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)
Savage Henry
  • 1,990
  • 3
  • 21
  • 29