How to fix samples < K-neighbours error in oversampling using SMOTE?

Question

I am designing a multi class classifier for 11 labels. I am using SMOTE to tackle the sampling problem. However I face the following error:-

Error at SMOTE

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, Y_res = sm.fit_sample(X_f, Y_f)

error

~/.local/lib/python3.6/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
    414                 "Expected n_neighbors <= n_samples, "
    415                 " but n_samples = %d, n_neighbors = %d" %
--> 416                 (train_size, n_neighbors)
    417             )
    418         n_samples, _ = X.shape

ValueError: Expected n_neighbors <= n_samples,  but n_samples = 1, n_neighbors = 6

Why does it say I have only 1 n_samples?

When I tried the same code for much smaller dataset of 100k rows (and only 4 labels), it ran just fine.

details about input

input parameters

X_f

array([[1.43347000e+05, 1.00000000e+00, 2.03869492e+03, ...,
        1.00000000e+00, 1.00000000e+00, 1.35233019e+03],
       [5.09050000e+04, 0.00000000e+00, 0.00000000e+00, ...,
        5.09050000e+04, 0.00000000e+00, 5.09050000e+04],
       [1.43899000e+05, 2.00000000e+00, 2.11447368e+03, ...,
        1.00000000e+00, 2.00000000e+00, 1.39707767e+03],
       ...,
       [8.50000000e+01, 0.00000000e+00, 0.00000000e+00, ...,
        8.50000000e+01, 0.00000000e+00, 8.50000000e+01],
       [2.33000000e+02, 4.00000000e+00, 4.90000000e+01, ...,
        4.00000000e+00, 4.00000000e+00, 7.76666667e+01],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

Y_f

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

dimensions of input parameters

print(X_f.shape, Y_f.shape)
(2087620, 31) (2087620, 11)

my attempts to use other techniques of `imblearn` package

Debugging the SMOTE fit_resample() method I know SMOTE works by synthesizing minority samples by using the Euclidean distance between the nearest neighbours of a minority data point. So I printed out the n_samples variable in the ../python3.6/site-packages/sklearn/neighbors/base.py file. It showed steadily decreasing samples from 5236 -> 103 -> 3, and then I got the error. I could not understand what is going on.

Using SVMSMOTE:- Takes too long to compute (over 2days), and PC crashes.
Using RandomOverSampler:- Model gives poor accuracy, of 45%
Using different sampling_strategy:- works for minority only.
Also of the suggestions provided here and here., unsuccessfully. I could not understand them, honestly.
Same error was received when I reduced dataset to 100k, 1k and 5k rows.

Despite trying, I do not understand much of it. I am a newbie at sampling. Can you help me fix this problem, please?

Maybe it's a bug in imblearn? – Has QUIT--Anony-Mousse Apr 10 '19 at 05:45 — Has QUIT--Anony-Mousse, Apr 10 '19 at 05:45

score 1 · Accepted Answer · answered Apr 30 '19 at 09:06

This error is occurring because some of the instances in the dataset are too less. For instance, in a 2M strong dataset, there was only one instance having a specific label, "��".

Hence for this instance, there are no samples for the SMOTE algorithm to make synthetic copies of. Check your dataset carefully, and make sure it is clean and usable.

The unnecessary instance was removed using df.where("Label != '��'")

Pedro Martins · Answer 2 · 2019-04-10T18:29:34.637

I had a similar problem today. The problem was fixed when I increased the number of rows of my dataset. I was first trying with a subsample of n_rows = 1000 when I changed to n_rows = 5000 didn't get the error anymore.

Since the input size of your dataset is super large, you may find useful to decrease the size of the dataset prior to apply the imblearn. In fact, you're going to find a couple of experiments on the web that demonstrates there is a threshold of dataset length where the classifier doesn't improve significantly its performance. Here one of these experiments.