I am designing a multi class classifier for 11 labels. I am using SMOTE
to tackle the sampling problem. However I face the following error:-
Error at SMOTE
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, Y_res = sm.fit_sample(X_f, Y_f)
error
~/.local/lib/python3.6/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
414 "Expected n_neighbors <= n_samples, "
415 " but n_samples = %d, n_neighbors = %d" %
--> 416 (train_size, n_neighbors)
417 )
418 n_samples, _ = X.shape
ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6
Why does it say I have only 1 n_samples?
When I tried the same code for much smaller dataset of 100k rows (and only 4 labels), it ran just fine.
details about input
input parameters
X_f
array([[1.43347000e+05, 1.00000000e+00, 2.03869492e+03, ...,
1.00000000e+00, 1.00000000e+00, 1.35233019e+03],
[5.09050000e+04, 0.00000000e+00, 0.00000000e+00, ...,
5.09050000e+04, 0.00000000e+00, 5.09050000e+04],
[1.43899000e+05, 2.00000000e+00, 2.11447368e+03, ...,
1.00000000e+00, 2.00000000e+00, 1.39707767e+03],
...,
[8.50000000e+01, 0.00000000e+00, 0.00000000e+00, ...,
8.50000000e+01, 0.00000000e+00, 8.50000000e+01],
[2.33000000e+02, 4.00000000e+00, 4.90000000e+01, ...,
4.00000000e+00, 4.00000000e+00, 7.76666667e+01],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])
Y_f
array([[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 1.]])
dimensions of input parameters
print(X_f.shape, Y_f.shape)
(2087620, 31) (2087620, 11)
my attempts to use other techniques of imblearn
package
Debugging the SMOTE fit_resample() method I know SMOTE works by synthesizing minority samples by using the Euclidean distance between the nearest neighbours of a minority data point. So I printed out the n_samples variable in the ../python3.6/site-packages/sklearn/neighbors/base.py file. It showed steadily decreasing samples from 5236 -> 103 -> 3, and then I got the error. I could not understand what is going on.
- Using
SVMSMOTE
:- Takes too long to compute (over 2days), and PC crashes. - Using
RandomOverSampler
:- Model gives poor accuracy, of 45% - Using different
sampling_strategy
:- works forminority
only. - Also of the suggestions provided here and here., unsuccessfully. I could not understand them, honestly.
- Same error was received when I reduced dataset to 100k, 1k and 5k rows.
Despite trying, I do not understand much of it. I am a newbie at sampling. Can you help me fix this problem, please?