I'm dealing with a multiclass classification problem, in which some classes are very imbalanced. My data looks like this:
product_description class
"This should be used to clean..." 1
"Beauty product, natural..." 2
"Cleaning product, be careful..." 2
"Food, prepared with fruits..." 2
"T-shirt, sports, white, light..." 3
"Cleaning product, used to ..." 2
"Blue pants, two pockets, men..." 3
So I needed to make a classification model. This is what my pipeline currently looks like:
X = df['product_description']
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
def text_process(mess):
STOPWORDS = stopwords.words("english")
# Check characters to see if they are in punctuation
nopunc = [char for char in mess if char not in string.punctuation]
# Join the characters again to form the string.
nopunc = "".join(nopunc)
# Now just remove any stopwords
return " ".join([word for word in nopunc.split() if word.lower() not in STOPWORDS])
pipe = Pipeline(
steps=[
("vect", CountVectorizer(analyzer= text_process)),
("feature_selection", SelectKBest(chi2, k=20)),
("polynomial", PolynomialFeatures(2)),
("reg", LogisticRegression()),
]
)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
However, I have a very imbalanced dataset, with the following distribution: class 1 - 80%, class 2 - 10%, class 3 - 5%, class 4 - 4%, class 5 - 1%. So I'm trying to apply SMOTE. However, I still couldn't understand where should SMOTE be applied.
At first, I thought about applying SMOTE before the Pipeline, but I got the following error:
ValueError: could not convert string to float: '...'
So I thought about using SMOTE with the Pipeline. But I also got an error. I tried using SMOTE() in the first step and also in the second step, after CountVectorizer - this is what seemed logical to me -, but both returned the same error:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE()' (type <class 'imblearn.over_sampling._smote.base.SMOTE'>) doesn't
Any idea on how to solve this issue? What am I missing in here?
Thanks