different score when using train_test_split before vs after SMOTETomek

Question

I'm trying to classify a text to a 6 different classes. Since I'm having an imbalanced dataset, I'm also using SMOTETomek method that should synthetically balance the dataset with additional artificial samples.

I've noticed a huge score difference when applying it via pipeline vs 'Step by step" where the only difference is (I believe) the place I'm using train_test_split

Here are my features and labels:

for curr_features, label in self.training_data:
    features.append(curr_features)
    labels.append(label)

algorithms = [
    linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None),
    naive_bayes.MultinomialNB(),
    naive_bayes.BernoulliNB(),
    tree.DecisionTreeClassifier(max_depth=1000),
    tree.ExtraTreeClassifier(),
    ensemble.ExtraTreesClassifier(),
    svm.LinearSVC(),
    neighbors.NearestCentroid(),
    ensemble.RandomForestClassifier(),
    linear_model.RidgeClassifier(),
]

Using Pipeline:

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Provide Report for all algorithms
score_dict = {}
for algorithm in algorithms:
    model = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('smote', SMOTETomek()),
        ('classifier', algorithm)
    ])
    model.fit(X_train, y_train)

    # Score
    score = model.score(X_test, y_test)
    score_dict[model] = int(score * 100)

sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}
for classifier, score in sorted_score_dict.items():
    print(f'{classifier.__class__.__name__}: score is {score}%')

Using Step by Step:

vectorizer = CountVectorizer()
transformer = TfidfTransformer()
cv = vectorizer.fit_transform(features)
text_tf = transformer.fit_transform(cv).toarray()

smt = SMOTETomek()
X_smt, y_smt = smt.fit_resample(text_tf, labels)

X_train, X_test, y_train, y_test = train_test_split(X_smt, y_smt, test_size=0.2, random_state=0)
self.test_classifiers(X_train, X_test, y_train, y_test, algorithms)

def test_classifiers(self, X_train, X_test, y_train, y_test, classifiers_list):
    score_dict = {}
    for model in classifiers_list:
        model.fit(X_train, y_train)

        # Score
        score = model.score(X_test, y_test)
        score_dict[model] = int(score * 100)
       
    print()
    print("SCORE:")
    sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}
    for model, score in sorted_score_dict.items():
        print(f'{model.__class__.__name__}: score is {score}%')

I'm getting (for the best classifier model) around 65% using pipeline vs 90% using step by step. Not sure what am I missing.

afsharov · Answer 1 · 2021-05-29T13:28:09.327

1

There is nothing wrong with your code by itself. But your step-by-step approach is using bad practice in Machine Learning theory:

Do not resample your testing data

In your step-by-step approach, you resample all of the data first and then split them into train and test sets. This will lead to an overestimation of model performance because you have altered the original distribution of classes in your test set and it is not representative of the original problem anymore.

What you should do instead is to leave the testing data in its original distribution in order to get a valid approximation of how your model will perform on the original data, which is representing the situation in production. Therefore, your approach with the pipeline is the way to go.

As a side note: you could think about shifting the whole data preparation (vectorization and resampling) out of your fitting and testing loop as you probably want to compare the model performance against the same data anyway. Then you would only have to run these steps once and your code executes faster.

edited May 29 '21 at 13:28

answered May 29 '21 at 10:36

afsharov

4,774
2
10
27

Thanks for the fast answer. Where should I resample then? Are you implying I shall remove SMOTETomek from the pipeline? – Ben May 29 '21 at 10:39
If you leave the SMOTETomek in the pipeline, it will still perform the resampling on the testing set as it passes through the transformation steps. I would suggest only resampling once after all transformations are done and outside the pipeline. Also updated the answer in this regard. – afsharov May 29 '21 at 11:01
I'm sorry for repeating myself. Can you update your answer with a relevant example for your suggested pipeline part? – Ben May 29 '21 at 11:57
2

"*SMOTETomek [...] will still perform the resampling on the testing set*" - this is not correct, as clarified in the docs; see own answer. – desertnaut May 29 '21 at 12:42
1

@desertnaut I missed that this must be an `imblearn` pipeline as `scikit-learn` pipelines do not work with samplers ... revised the answer. Thanks for the clarification. – afsharov May 29 '21 at 13:32
You are welcome; it is not clarified in the post indeed, but I don't think sklearn pipelines can work with SMOTE. – desertnaut May 29 '21 at 13:47
Correct, only imblearn has SMOTE implementation. – Hamish Gibson May 29 '21 at 21:37

desertnaut · Accepted Answer · 2021-05-29T19:33:21.850

The correct approach in such cases is described in detail in own answer in the Data Science SE thread Why you shouldn't upsample before cross validation (although the answer is about CV, the rationale is identical for the train/test split case as well). In short, any resampling method (SMOTE included) should be applied only to the training data and not to the validation or test ones.

Given that, your Pipeline approach here is correct: you apply SMOTE only to your training data after splitting, and, according to the documentation of the imblearn pipeline:

The samplers are only applied during fit.

So, no SMOTE is actually applied to your test data during model.score, which is exactly as it should be.

Your step-by-step approach, on the other hand, is wrong on many levels, and SMOTE is only one of them; all these preprocessing steps should be applied after the train/test split, and fitted only on the training portion of your data, which is not the case here, thus the results are invalid (no wonder they look "better"). For a general discussion (and a practical demonstration) of how & why such preprocessing should be applied only to the training data, see my (2) answers in Should Feature Selection be done before Train-Test Split or after? (again, the discussion there is about feature selection, but it is applicable to such feature engineering tasks like count vectorizer and TF-IDF transformation as well).

different score when using train_test_split before vs after SMOTETomek

2 Answers2