The py-earth package supports missing data. It's still in development and not yet on pypi, but it's pretty usable and well tested at this point and interacts well with scikit-learn. Missingness is handled as described in this paper. It does not assume missingness-at-random, and in fact missingness is treated as potentially predictive. The important assumption is that the distribution of missingness in your training data must be the same as in whatever data you use the model with in operation.
The Earth
class provided by py-earth is a regressor. To create a classifier, you need to put it in a pipeline with some other scikit-learn classifier (I usually use LogisticRegression
for this). Here's an example:
from pyearth import Earth
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.pipeline import Pipeline
# X and y are some training data (numpy arrays, pandas DataFrames, or
# similar) and X may have some values that are missing (nan, None, or
# some other standard signifier of missingness)
from your_data import X, y
# Create an Earth based classifer that accepts missing data
earth_classifier = Pipeline([('earth', Earth(allow_missing=True)),
('logistic', LogisticRegression())])
# Fit on the training data
earth_classifier.fit(X, y)
The Earth
model handles missingness in a nice way, and the LogisticRegression
only sees the transformed data coming out of Earth.transform
.
Disclaimer: I am an author of py-earth.