3

My issue concerns the Value Error raised by SMOTE class.

Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6

# imbalanced learn is a package containing impelementation of SMOTE
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.pipeline import Pipeline
# label column (everythin except the first column)
y = feature_set.iloc[:,0]
# feature matrix: everything except text and label columns
x = feature_set.loc[:, feature_set.columns != 'text_column']
x = x.loc[:, x.columns != 'label_column']
x_resampled, y_resampled = SMOTE().fit_resample(x, y)

After some investigation I have found out that some of my classes (all in all 158) were extremely undersampled.

According to the solution proposed in this post

Create a pipeline that is using SMOTE and RandomOversampler in a way that satisfies the condition n_neighbors <= n_samples for smoted classes and uses random oversampling when the condition is not satisfied.

However, I am still struggling to set my experiment up and running.

# initilize oversamplers
smote = SMOTE()
randomSampler = RandomOverSampler()
# create a pipeline
pipeline = Pipeline([('smote', smote), ('randomSampler', randomSampler)])
pipeline.fit_resample(x, y)

And when I run it I have still the same error. My guess, is that the generated pipeline applies both samplers, whereas I need only one of them to be applied at once, based on a predefined condition (if number of items is less than X then RandomSampler, SMOTE otherwise). Is there a way to set a condition to call RandomSampler in case of an extremely low number of items?

Thank you in advance.

Alibek Jakupov
  • 620
  • 6
  • 14
  • Please edit to clarify; it is currently unclear what *exactly* your question is, if it is indeed about the error you mention in the beginning (where exactly does it pop up?), or what exactly you mean by "struggling" – desertnaut May 07 '19 at 16:31
  • I tried to clarify the question. Does this seem more clear? Actually the question is about setting custom conditions in a pipeline. – Alibek Jakupov May 08 '19 at 17:19

1 Answers1

2

I also encountered the same problem as you (Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6) and read and followed that guy's advice just like you.

I think you are getting the same error because you are putting the random oversampler AFTER the SMOTE operation. That is, you need to oversample your minority classes BEFORE applying the SMOTE algorithm.

This worked for me:

pipe = Pipeline([
('tfidf', TfidfVectorizer()), 
('ros', RandomOverSampler()),
('oversampler', SMOTE()),
('clf', LinearSVC()),
])

BringBackCommodore64
  • 4,910
  • 3
  • 27
  • 30