8

I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE.

Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation.

My current code is as follows. However, as mentioned above it only uses single iteration.

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)

I am happy to provide more details if needed.

Alexander L. Hayes
  • 3,892
  • 4
  • 13
  • 34
EmJ
  • 4,398
  • 9
  • 44
  • 105

3 Answers3

21

You need to perform SMOTE within each fold. Accordingly, you need to avoid train_test_split in favour of KFold:

from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score

kf = KFold(n_splits=5)

for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train = X[train_index]
    y_train = y[train_index]  # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
    X_test = X[test_index]
    y_test = y[test_index]  # See comment on ravel and  y_train
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model = ...  # Choose a model here
    model.fit(X_train_oversampled, y_train_oversampled )  
    y_pred = model.predict(X_test)
    print(f'For fold {fold}:')
    print(f'Accuracy: {model.score(X_test, y_test)}')
    print(f'f-score: {f1_score(y_test, y_pred)}')

You can also, for example, append the scores to a list defined outside.

gmds
  • 19,325
  • 4
  • 32
  • 58
  • 3
    As a note: you may wish to use `StratifiedKFold` instead, as in the other answer, since you presumably have an imbalanced class problem. – gmds Apr 09 '19 at 11:20
  • thanks a lot. I also have a y value. In that case, how can I change this `in enumerate(kf.split(X), 1):`? – EmJ Apr 09 '19 at 11:22
  • 1
    @Emi you shouldn't need to modify that. What `kf.split` does is just take the *size* of `X` (how many rows it has) to determine how to generate indices for each fold. Since your `y` should be the same size as `X`, you won't need to provide it. That said, you *can* do `kf.split(X, y)` and it will have the same effect. – gmds Apr 09 '19 at 11:24
  • @gmds A small question: why didn't you fit the model on the oversampled data ``` X_train_oversampled ``` and ```y_train_oversampled```, and you rather did ```model.fit(X_train, y_train) ``` ? – Perl Del Rey Nov 14 '19 at 16:03
  • 1
    @Hiyam That was actually my mistake, thanks! Will edit. – gmds Nov 15 '19 at 00:44
  • @gmds I came across your answer and have a fundamental question about it. Based on your answer, for each fold, we will calculate accuracy and f1_score. I assume that we can use any other type of metric as well. now when the for loop is done, what is finally score? is it the average of all scores we found for all folds? if now we want to run the model with different parameters (grid-search), we should compare the average score for each grid search? – Ross_you Oct 12 '20 at 18:27
5
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import SMOTE

cv = StratifiedKFold(n_splits=5)
for train_idx, test_idx, in cv.split(X, y):
    X_train, y_train = X[train_idx], y[train_idx]
    X_test, y_test = X[test_idx], y[test_idx]
    X_train, y_train = SMOTE().fit_sample(X_train, y_train)
    ....
BrooksLee
  • 91
  • 8
1

I think you can also solve this with a pipeline from the imbalanced-learn library.

I saw this solution in a blog called Machine Learning Mastery https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

The idea is to use a pipeline from imblearn to do the cross-validation. Please, let me know if that works. The example below is with a decision tree, but the logic is the same.

#decision tree evaluated on imbalanced dataset with SMOTE oversampling
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
    n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define pipeline
steps = [('over', SMOTE()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
score =  mean(scores))