SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

Question

I have already pre-cleaned the data, and below shows the format of the top 4 rows:

     [IN] df.head()

    [OUT]   Year    cleaned
         0  1909    acquaint hous receiv follow letter clerk crown...
         1  1909    ask secretari state war whether issu statement...
         2  1909    i beg present petit sign upward motor car driv...
         3  1909    i desir ask secretari state war second lieuten...
         4  1909    ask secretari state war whether would introduc...

I have called train_test_split() as follows:

     [IN] X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['Year'], random_state=2)
   [Note*] `X_train` and `y_train` are now Pandas.core.series.Series of shape (1785,) and `X_test` and `y_test` are also Pandas.core.series.Series of shape (595,)

I have then vectorized the X training and testing data using the following TfidfVectorizer and fit/transform procedures:

     [IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1, 1), sublinear_tf=True)
          X_train = v.fit_transform(X_train)
          X_test = v.transform(X_test)

I'm now at the stage where I would normally apply a classifier, etc (if this were a balanced set of data). However, I initialize imblearn's SMOTE() class (to perform over-sampling)...

     [IN] smote_pipeline = make_pipeline_imb(SMOTE(), classifier(random_state=42))
          smote_model = smote_pipeline.fit(X_train, y_train)
          smote_prediction = smote_model.predict(X_test)

... but this results in:

     [OUT] ValueError: "Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6.

I've attempted to whittle down the number of n_neighbors but to no avail, any tips or advice would be much appreciated. Thanks for reading.

------------------------------------------------------------------------------------------------------------------------------------

EDIT:

Full Traceback

The dataset/dataframe (df) contains 2380 rows across two columns, as shown in df.head() above. X_train contains 1785 of these rows in the format of a list of strings (df['cleaned']) and y_train also contains 1785 rows in the format of strings (df['Year']).

Post-vectorization using TfidfVectorizer(): X_train and X_test are converted from pandas.core.series.Series of shape '(1785,)' and '(595,)' respectively, to scipy.sparse.csr.csr_matrix of shape '(1785, 126459)' and '(595, 126459)' respectively.

As for the number of classes: using Counter(), I've calculated that there are 199 classes (Years), each instance of a class is attached to one element of aforementioned df['cleaned'] data which contains a list of strings extracted from a textual corpus.

The objective of this process is to automatically determine/guess the year, decade or century (any degree of classification will do!) of input textual data based on vocabularly present.

The error message is pretty self-explanatory, isn't it? I guess you need more samples (rows) in your `X_train` — MaxU - stand with Ukraine, Mar 21 '18 at 00:07
Also please tell us your class imbalance. How many classes and how many samples in each class? — Stev, Mar 21 '18 at 09:22
Thanks for your responses everyone, I've done my best to address your questions in my edit to the original post. Please let me know if there's anything I could correct at all! — Dbercules, Mar 21 '18 at 14:50

score 26 · Accepted Answer · answered Mar 22 '18 at 00:35

26

Since there are approximately 200 classes and 1800 samples in the training set, you have on average 9 samples per class. The reason for the error message is that a) probably the data are not perfectly balanced and there are classes with less than 6 samples and b) the number of neighbors is 6. A few solutions for your problem:

Calculate the minimum number of samples (n_samples) among the 199 classes and select n_neighbors parameter of SMOTE class less or equal to n_samples.
Exclude from oversampling the classes with n_samples < n_neighbors using the ratio parameter of SMOTE class.
Use RandomOverSampler class which does not have a similar restriction.
Combine 3 and 4 solutions: Create a pipeline that is using SMOTE and RandomOversampler in a way that satisfies the condition n_neighbors <= n_samples for smoted classes and uses random oversampling when the condition is not satisfied.

answered Mar 22 '18 at 00:35

Thanks for the answer, what about reducing the number of classes to e.g. decades (for prediction alone) instead of individual years? I'll crack on with your suggestions in the meantime! – Dbercules Mar 24 '18 at 15:49
I was unable to investigate (1) and (2) as some classes only possess one sample. However, I was able to successfully pipeline RandomOverSampler (and/or a FakeSampler Class), followed by SMOTE and the Classifier as shown: `make_pipeline(sampler, SMOTE(), clf)`. I'll proceed with this and see what I can do with it! Thanks for your time! – Dbercules Mar 26 '18 at 21:28
2

@Dbercules: hi, can you please guide me, how did you do make the pipeline? I tried `sm = SMOTE(random_state=42)` `rm = RandomOverSampler(random_state=42)` `my_pipe = make_pipeline(sm, rm)` `X_res, Y_res = my_pipe.fit_resample(X, y)` But got the error, same as the title question – cappy0704 Apr 15 '19 at 17:30

score 5 · Answer 2 · edited Jun 28 '19 at 17:23

5

Try to do the below code for SMOTE

oversampler=SMOTE(kind='regular',k_neighbors=2)

This worked for me.

edited Jun 28 '19 at 17:23

ysf

4,634
3
27
29

answered Jun 28 '19 at 14:29

Remi_TRish

193
1
8

1

I got this error `TypeError: __init__() got an unexpected keyword argument 'kind' ` – Amin Khodamoradi Mar 03 '22 at 07:24

score 3 · Answer 3 · answered Apr 10 '21 at 16:49

WHY IT OCCURS:

In my case it was occurring because i had as few samples as 1 for some of the values/categories. Since SMOTE is based on KNN concept, it's not possible to apply SMOTE on 1 sampled values.

HOW I SOLVED IT:

Since those 1 sampled values/categories were equivalent to outliers, i removed them from the dataset and then applied SMOTE and it worked.

Also try decreasing the k_neighbors parameter to make it work

xr, yr = SMOTE(k_neighbors=3).fit_resample(x, y)

score 0 · Answer 4 · answered Sep 03 '19 at 09:27

0

I think that's possible to use the code:

sampler = SMOTE(ratio={1: 1927, 0: 300},random_state=0)

answered Sep 03 '19 at 09:27

KARMA

1
2

score 0 · Answer 5 · answered Dec 13 '22 at 01:15

I was able to solve this issue following number 1 of this answer.

from collections import Counter

Count(df) # get the classes

# drop the classes with 1 as their value because it's lower than k_neighbors which has 2 as minimum value in my case

X_res, y_res = SMOTE(k_neighbors = 2).fit_resample(X, y)

SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

5 Answers5

Linked