27

I am using scikit-learn for some data analysis, and my dataset has some missing values (represented by NA). I load the data in with genfromtxt with dtype='f8' and go about training my classifier.

The classification is fine on RandomForestClassifier and GradientBoostingClassifier objects, but using SVC from sklearn.svm causes the following error:

    probas = classifiers[i].fit(train[traincv], target[traincv]).predict_proba(train[testcv])
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 409, in predict_proba
    X = self._validate_for_predict(X)
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 534, in _validate_for_predict
    X = atleast2d_or_csr(X, dtype=np.float64, order="C")
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 84, in atleast2d_or_csr
    assert_all_finite(X)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 20, in assert_all_finite
    raise ValueError("array contains NaN or infinity")
ValueError: array contains NaN or infinity

What gives? How can I make the SVM play nicely with the missing data? Keeping in mind that the missing data works fine for random forests and other classifiers..

Jim
  • 4,509
  • 16
  • 50
  • 80

3 Answers3

26

You can do data imputation to handle missing values before using SVM.

EDIT: In scikit-learn, there's a really easy way to do this, illustrated on this page.

(copied from page and modified)

>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> # missing_values is the value of your placeholder, strategy is if you'd like mean, median or mode, and axis=0 means it calculates the imputation based on the other feature values for that sample
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit(train)
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> train_imp = imp.transform(train)
ogrisel
  • 39,309
  • 12
  • 116
  • 125
Wei
  • 458
  • 6
  • 11
  • 1
    What about Infinite values? This indicates a strategy only with NaN (i.e. division by zero) – lefterav Oct 10 '14 at 17:11
  • I did this but the transformation changed the data to a non integer array. If I don't impute the svm classification works fine, but when I impute the data I get the error `IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices`. Any tips ? – Dhanush Gopinath Jun 14 '17 at 11:41
  • This answer seems to be outdated. I always end up with: "ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing'". – Hagbard May 31 '21 at 11:02
6

You can either remove the samples with missing features or replace the missing features with their column-wise medians or means.

Gyan Veda
  • 6,309
  • 11
  • 41
  • 66
ogrisel
  • 39,309
  • 12
  • 116
  • 125
2

The most popular answer here is outdated. "Imputer" is now "SimpleImputer". The current way to solve this issue is given here. Imputing the training and testing data worked for me as follows:

from sklearn import svm
import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(x_train)

X_train_imp = imp.transform(x_train)
X_test_imp = imp.transform(x_test)
    
clf = svm.SVC()
clf = clf.fit(X_train_imp, y_train)
predictions = clf.predict(X_test_imp)
Hagbard
  • 3,430
  • 5
  • 28
  • 64