5

I am trying to use SMOTE from imblearn package in Python, but my data has a lot of missing values and I got the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I checked the parameters here, and it seems that there is not one dealing with missing value.

Is there a way to generate synthetic samples with missing values?

sophros
  • 14,672
  • 11
  • 46
  • 75
MJeremy
  • 1,102
  • 17
  • 27

2 Answers2

4

SMOTE does not perform filling up your missing or NaN values. You need to fill them up and then feed for SMOTE analysis. Dealing with missing values is a different task altogether, you can take a look at Imputer from sklearn to begin with. Here is another write-up on sklearn regarding missing values : Imputing missing values

Once you have finished dealing with NaN values, then feed your modified data to SMOTE.

References

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
  • 1
    Xgboost and lighgbm fitting data with missing values, thus I thought it's possible that generate some synthetic data even when there is missing value. Maybe not SMOTE, but I intuitively thought there might be some way. Thanks for your answer! – MJeremy Jul 13 '18 at 12:55
-1

A simple example is the following:

# Imports
from collections import Counter
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import Imputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

# Load data
bc = load_breast_cancer()
X, y = bc.data, bc.target

# Initial number of samples per class
print('Number of samples for both classes: {} and {}.'.format(*Counter(y).values()))

# SMOTEd class distribution
print('Dataset has %s missing values.' % np.isnan(X).sum())
_, y_resampled = SMOTE().fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))

# Generate artificial missing values
X[X > 1.0] = np.nan
print('Dataset has %s missing values.' % np.isnan(X).sum())
_, y_resampled = make_pipeline(Imputer(), SMOTE()).fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))
  • Looks like SMOTE cannot handle NaNs. ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). – Anuj Gupta Aug 12 '19 at 06:08