My issue concerns the Value Error raised by SMOTE class.
Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6
# imbalanced learn is a package containing impelementation of SMOTE
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.pipeline import Pipeline
# label column (everythin except the first column)
y = feature_set.iloc[:,0]
# feature matrix: everything except text and label columns
x = feature_set.loc[:, feature_set.columns != 'text_column']
x = x.loc[:, x.columns != 'label_column']
x_resampled, y_resampled = SMOTE().fit_resample(x, y)
After some investigation I have found out that some of my classes (all in all 158) were extremely undersampled.
According to the solution proposed in this post
Create a pipeline that is using SMOTE and RandomOversampler in a way that satisfies the condition n_neighbors <= n_samples for smoted classes and uses random oversampling when the condition is not satisfied.
However, I am still struggling to set my experiment up and running.
# initilize oversamplers
smote = SMOTE()
randomSampler = RandomOverSampler()
# create a pipeline
pipeline = Pipeline([('smote', smote), ('randomSampler', randomSampler)])
pipeline.fit_resample(x, y)
And when I run it I have still the same error. My guess, is that the generated pipeline applies both samplers, whereas I need only one of them to be applied at once, based on a predefined condition (if number of items is less than X then RandomSampler, SMOTE otherwise). Is there a way to set a condition to call RandomSampler in case of an extremely low number of items?
Thank you in advance.