How can I use SMOTE in a Sklearn Pipeline for a NLP Classification problem?

Question

I'm dealing with a multiclass classification problem, in which some classes are very imbalanced. My data looks like this:

product_description                  class
"This should be used to clean..."    1
"Beauty product, natural..."         2
"Cleaning product, be careful..."    2
"Food, prepared with fruits..."      2
"T-shirt, sports, white, light..."   3
"Cleaning product, used to ..."      2
"Blue pants, two pockets, men..."    3

So I needed to make a classification model. This is what my pipeline currently looks like:

X = df['product_description']
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

def text_process(mess):

    STOPWORDS = stopwords.words("english")

    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = "".join(nopunc)

    # Now just remove any stopwords
    return " ".join([word for word in nopunc.split() if word.lower() not in STOPWORDS])

pipe = Pipeline(
steps=[
    ("vect", CountVectorizer(analyzer= text_process)),
    ("feature_selection", SelectKBest(chi2, k=20)),
    ("polynomial", PolynomialFeatures(2)),
    ("reg", LogisticRegression()),
]
)

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print(classification_report(y_test, y_pred))

However, I have a very imbalanced dataset, with the following distribution: class 1 - 80%, class 2 - 10%, class 3 - 5%, class 4 - 4%, class 5 - 1%. So I'm trying to apply SMOTE. However, I still couldn't understand where should SMOTE be applied.

At first, I thought about applying SMOTE before the Pipeline, but I got the following error:

ValueError: could not convert string to float: '...'

So I thought about using SMOTE with the Pipeline. But I also got an error. I tried using SMOTE() in the first step and also in the second step, after CountVectorizer - this is what seemed logical to me -, but both returned the same error:

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE()' (type <class 'imblearn.over_sampling._smote.base.SMOTE'>) doesn't

Any idea on how to solve this issue? What am I missing in here?

Thanks

score 4 · Answer 1 · answered Sep 08 '21 at 14:09

Using a resampler like SMOTE requires the imblearn version of Pipeline.

This is because resamplers have to change both X and y, and ordinary sklearn pipelines do not do this. The imblearn pipeline accommodates by allowing its intermediate steps to use either transform or sample (and importantly, resampling only happens during fitting, on the training data, and not on later transformations/predictions). Otherwise it should operate the same as an ordinary sklearn pipeline.

How can I use SMOTE in a Sklearn Pipeline for a NLP Classification problem?

1 Answers1