SMOTE with missing values

Question

I am trying to use SMOTE from imblearn package in Python, but my data has a lot of missing values and I got the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I checked the parameters here, and it seems that there is not one dealing with missing value.

Is there a way to generate synthetic samples with missing values?

Gambit1614 · Answer 1 · 2018-07-13T12:37:54.987

4

SMOTE does not perform filling up your missing or NaN values. You need to fill them up and then feed for SMOTE analysis. Dealing with missing values is a different task altogether, you can take a look at Imputer from sklearn to begin with. Here is another write-up on sklearn regarding missing values : Imputing missing values

Once you have finished dealing with NaN values, then feed your modified data to SMOTE.

References

edited Jul 13 '18 at 12:37

answered Jul 13 '18 at 12:32

Gambit1614

8,547
1
25
51

1

Xgboost and lighgbm fitting data with missing values, thus I thought it's possible that generate some synthetic data even when there is missing value. Maybe not SMOTE, but I intuitively thought there might be some way. Thanks for your answer! – MJeremy Jul 13 '18 at 12:55

score -1 · Answer 2 · 2018-07-14T15:24:57.773

A simple example is the following:

# Imports
from collections import Counter
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import Imputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

# Load data
bc = load_breast_cancer()
X, y = bc.data, bc.target

# Initial number of samples per class
print('Number of samples for both classes: {} and {}.'.format(*Counter(y).values()))

# SMOTEd class distribution
print('Dataset has %s missing values.' % np.isnan(X).sum())
_, y_resampled = SMOTE().fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))

# Generate artificial missing values
X[X > 1.0] = np.nan
print('Dataset has %s missing values.' % np.isnan(X).sum())
_, y_resampled = make_pipeline(Imputer(), SMOTE()).fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))

Looks like SMOTE cannot handle NaNs. ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). — Anuj Gupta, Aug 12 '19 at 06:08

SMOTE with missing values

2 Answers2