How to get SVMs to play nicely with missing data in scikit-learn?

Question

I am using scikit-learn for some data analysis, and my dataset has some missing values (represented by NA). I load the data in with genfromtxt with dtype='f8' and go about training my classifier.

The classification is fine on RandomForestClassifier and GradientBoostingClassifier objects, but using SVC from sklearn.svm causes the following error:

    probas = classifiers[i].fit(train[traincv], target[traincv]).predict_proba(train[testcv])
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 409, in predict_proba
    X = self._validate_for_predict(X)
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 534, in _validate_for_predict
    X = atleast2d_or_csr(X, dtype=np.float64, order="C")
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 84, in atleast2d_or_csr
    assert_all_finite(X)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 20, in assert_all_finite
    raise ValueError("array contains NaN or infinity")
ValueError: array contains NaN or infinity

What gives? How can I make the SVM play nicely with the missing data? Keeping in mind that the missing data works fine for random forests and other classifiers..

score 26 · Accepted Answer · edited Oct 16 '13 at 13:10

26

You can do data imputation to handle missing values before using SVM.

EDIT: In scikit-learn, there's a really easy way to do this, illustrated on this page.

(copied from page and modified)

>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> # missing_values is the value of your placeholder, strategy is if you'd like mean, median or mode, and axis=0 means it calculates the imputation based on the other feature values for that sample
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit(train)
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> train_imp = imp.transform(train)

edited Oct 16 '13 at 13:10

ogrisel

39,309
12
116
125

answered Jul 12 '12 at 15:34

Wei

458
6
11

1

What about Infinite values? This indicates a strategy only with NaN (i.e. division by zero) – lefterav Oct 10 '14 at 17:11
I did this but the transformation changed the data to a non integer array. If I don't impute the svm classification works fine, but when I impute the data I get the error `IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices`. Any tips ? – Dhanush Gopinath Jun 14 '17 at 11:41
This answer seems to be outdated. I always end up with: "ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing'". – Hagbard May 31 '21 at 11:02

score 6 · Answer 2 · edited Mar 05 '14 at 18:07

6

You can either remove the samples with missing features or replace the missing features with their column-wise medians or means.

edited Mar 05 '14 at 18:07

Gyan Veda

6,309
11
41
66

answered Jul 12 '12 at 08:17

ogrisel

39,309
12
116
125

score 2 · Answer 3 · answered May 31 '21 at 11:31

The most popular answer here is outdated. "Imputer" is now "SimpleImputer". The current way to solve this issue is given here. Imputing the training and testing data worked for me as follows:

from sklearn import svm
import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(x_train)

X_train_imp = imp.transform(x_train)
X_test_imp = imp.transform(x_test)
    
clf = svm.SVC()
clf = clf.fit(X_train_imp, y_train)
predictions = clf.predict(X_test_imp)

How to get SVMs to play nicely with missing data in scikit-learn?

3 Answers3