0

I have a dataset that is working nicely in weka. It has a lot of missing values represented by '?'. Using a decision tree, I am able to deal with the missing values.

However, on sci-kit learn, I see that the estimators can't used with data with missing values. Is there an alternative library I can use instead that would support this?

Otherwise, is there a way to get around this in sci-kit learn?

mugetsu
  • 4,228
  • 9
  • 50
  • 79
  • I don't wanna mark your question as a duplicate of this one http://stackoverflow.com/questions/9365982/missing-values-in-scikits-machine-learning ? However hopefully it has answered your question – Anthony Kong Nov 25 '15 at 01:30
  • @AnthonyKong yea, I saw that post. But they all seem to suggestion imputation as the solution, which I what I want to avoid – mugetsu Nov 25 '15 at 01:39
  • According to the doc, there seems to be no other way http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values – Anthony Kong Nov 25 '15 at 01:43
  • Some packages in R support this. – dukebody Apr 23 '16 at 18:09

1 Answers1

1

The py-earth package supports missing data. It's still in development and not yet on pypi, but it's pretty usable and well tested at this point and interacts well with scikit-learn. Missingness is handled as described in this paper. It does not assume missingness-at-random, and in fact missingness is treated as potentially predictive. The important assumption is that the distribution of missingness in your training data must be the same as in whatever data you use the model with in operation.

The Earth class provided by py-earth is a regressor. To create a classifier, you need to put it in a pipeline with some other scikit-learn classifier (I usually use LogisticRegression for this). Here's an example:

from pyearth import Earth
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.pipeline import Pipeline

# X and y are some training data (numpy arrays, pandas DataFrames, or
# similar) and X may have some values that are missing (nan, None, or 
# some other standard signifier of missingness)
from your_data import X, y

# Create an Earth based classifer that accepts missing data
earth_classifier = Pipeline([('earth', Earth(allow_missing=True)),
                             ('logistic', LogisticRegression())])

# Fit on the training data
earth_classifier.fit(X, y)

The Earth model handles missingness in a nice way, and the LogisticRegression only sees the transformed data coming out of Earth.transform.

Disclaimer: I am an author of py-earth.

jcrudy
  • 3,921
  • 1
  • 24
  • 31