1

I'm using SMOTE function for oversampling my sparse data set which contains around 98% 0s & 2% 1s.I used following code

from imblearn.over_sampling import SMOTE
import os
import pandas as pd
df_input= pd.read_csv('input_tr.csv',index_col=0) 
train_X=df_input.ix[:, df_input.columns != 'row_num']
df_output=pd.read_csv("output_tr.csv",index_col=0)
train_y=df_output
sm = SMOTE(random_state=12, ratio = 1.0)
train_X_sm,train_y_sm=sm.fit_sample(train_X,train_y)

I'm getting following error

line 347, in kneighbors
(train_size, n_neighbors)
ValueError: Expected n_neighbors <= n_samples,  but n_samples = 4, n_neighbors = 6

Can you please help me to solve this error?

Python Learner
  • 437
  • 2
  • 11
  • 28

2 Answers2

8

I had a similiar issue.

SMOTE is based in a KNN algorithm, so you need a minimal number of samples to create a new instance of this subset.

For example:

  • If you is trying to predict is a integer value, class 1, 2, 3, and supposing that you have just 2 samples of class 1, how to get k-3 neighbors? Will be impossible. It's too umbalanced!!

The message is pretty clear:

Expected n_neighbors <= n_samples.

So, you need have more or equals SAMPLES than neighbors, to create new instances.

I look yout dataset and you have just 4 samples of OUTPUT 1. So, the message is saying you have just 4 but I need 6 neighbors to create a new instance of them.

Andre Araujo
  • 2,348
  • 2
  • 27
  • 41
1

It's basically the issue of having an unbalanced dataset that doesn't allow the use of KNN.

Iffor instace you would have a class with only one instance you will not be able to compute SMOTE as you will get the below error: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 2

Which basically means that the one single instance doesn't have any neighbours to use for the oversampling.

  • The solution would be to use instead another technique - such as RandomOversampling ( no issues for that even with one instance per class).
  • Or alternatively choose to remove the classes that have too few instances.
  • Or, alternatively specify another value for k other than the default: k = 1 #number of neighbours sm = SMOTE(k_neighbors=k, random_state=seed)