4

Edited post

This is a short and somewhat clarified version of the original post.

  1. We've got a training dataset (some features are significantly correlated). The feature space has 20 dimensions (all continuous).
  2. We need to train a nonparametric (most features form nonlinear subspaces and we can't assume a distribution for any of them) imputer (kNN or tree-based regression) using the training data.
  3. We need to predict multiple missing values in query data (a query feature-vector can have up to 13 missing features, so the imputer should handle any combination of missing features) using the trained imputer. NOTE the imputer should not be in any way retrained/fitted using the query data (like it is done in all mainstream R packages I've found so far: Amelia, impute, mi and mice...). That is the imputation should be based solely on the training data.
  4. The purpose for all this is described below.
  5. A small data sample is down below.

Original post (TL;DR)

Simply put, I've got some sophisticated data imputing to do. We've got a training dataset of ~100k 20D samples and a smaller testing dataset. Each feature/dimension is a continuous variable, but the scales are different. There are two distinct classes. Both datasets are very NA-inflated (NAs are not equally distributed across dimensions). I use sklearn.ensemble.ExtraTreesClassifier for classification and, although tree ensembles can handle missing data cases, there are three reasons to perform imputation

  1. This way we get votes from all trees in a forest during classification of a query dataset (not just those that don't have a missing feature/features).
  2. We don't loose data during training.
  3. scikit implementation of tree ensembles (both ExtraTrees and RandomForest) do not handle missing values. But this point is not that much important. If it wasn't for the former two I would've just used rpy2 + some nice R implementation.

Things are quite simple with the training dataset because I can apply class-specific median imputation strategy to deal with missing values and this approach has been working fine so far. Obviously this approach can't be applied to a query - we don't have the classes to begin with. Since we know that the classes will likely have significantly different shares in the query we can't apply a class-indifferent approach because that might introduce bias and reduce classification performance, therefore we need to impute missing values from a model.

Linear models are not an option for several reasons:

  1. all features are correlated to some extent;
  2. theoretically we can get all possible combinations of missing features in a sample feature-vector, even though our tool requires at least 7 non-missing features we end up with ~1^E6 possible models, this doesn't look very elegant if you ask me.

Tree-based regression models aren't good for the very same reason. So we ended up picking kNN (k nearest neighbours), ball tree or LSH with radius threshold to be more specific. This approach fits the task quite well, because dimensions (ergo distances) are correlated, hence we get nice performance in extremely NA-rich cases, but there are several drawbacks:

  1. I haven't found a single implementation in Python (including impute, sklearn.preprocessing.Imputer, orange) that handles feature-vectors with different sets of missing values, that is we want to have only one imputer for all possible combinations of missing features.
  2. kNN uses pair-wise point distances for prediction/imputation. As I've already mentioned our variables have different scales, hence the feature space must be normalised prior to distance estimations. And we need to know theoretic max/min values for each dimension to scale it properly. This is not as much of a problem, as it is a matter architectural simplicity (a user will have to provide a vector of min/max values).

So here is what I would like to hear from you:

  1. Are there any classic ways to address the kNN-related issues given in the list above? I believe this must be a common case, yet I haven't found anything specific on the web.
  2. Is there a better way to impute data in our case? What would you recommend? Please, provide implementations in Python (R and C/C++ are considered as well).

Data

Here is a small sample of the training data set. I reduced the number of features to make it more readable. The query data has identical structure, except for the obvious absence of category information.

v1  v2  v3  v4  v5  category
0.40524 0.71542 NA  0.81033 0.8209  1
0.78421 0.76378 0.84324 0.58814 0.9348  2
0.30055 NA  0.84324 NA  0.60003 1
0.34754 0.25277 0.18861 0.28937 0.41394 1
NA  0.71542 0.10333 0.41448 0.07377 1
0.40019 0.02634 0.20924 NA  0.85404 2
0.56404 0.5481  0.51284 0.39956 0.95957 2
0.07758 0.40959 0.33802 0.27802 0.35396 1
0.91219 0.89865 0.84324 0.81033 0.99243 1
0.91219 NA  NA  0.81033 0.95988 2
0.5463  0.89865 0.84324 0.81033 NA  2
0.00963 0.06737 0.03719 0.08979 0.57746 2
0.59875 0.89865 0.84324 0.50834 0.98906 1
0.72092 NA  0.49118 0.58814 0.77973 2
0.06389 NA  0.22424 0.08979 0.7556  2
Eli Korvigo
  • 10,265
  • 6
  • 47
  • 73
  • Missing values being not handled seems to be very much the reason for the question. Can you clarify how using the R implementation (that does imputation of missing values IIRC) would "loose data during training" ? – lgautier Sep 06 '15 at 12:41
  • 1. the statement you cite refers to the Python implementation (and Python solution is preferred) 2. the emphasis is on imputing multiple missing values in testing/query data, hence the fact that R implementation performs imputing during training is irrelevant. – Eli Korvigo Sep 06 '15 at 20:42
  • 3
    Please see [ask] and [How to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). I've edited to try to get this a little more digest, but it's still very long and not clear for someone wich is not working in your branch I think – Tensibai Sep 07 '15 at 08:24
  • Can you explain more the nature of your data and how you want to classify it? Imputation may not be the way to go. Are you open to algorithm suggestions that are not kNN? – Aabglov Sep 09 '15 at 20:58
  • Sorry for being so late with the update. Hope the question is clear now. – Eli Korvigo Sep 09 '15 at 21:31

1 Answers1

5

Based on the new update I think I would recommend against kNN or tree-based algorithms here. Since imputation is the goal and not a consequence of the methods you're choosing you need an algorithm that will learn to complete incomplete data.

To me this seems very well suited to use a denoising autoencoder. If you're familiar with Neural Networks it's the same basic principle. Instead of training to predict labels you train the model to predict the input data with a notable twist.

The 'denoising' part refers to a intermediate step where you randomly set some percentage of the input data to 0 before attempting to predict it. This forces the algorithm to learn more rich features and how to complete the data when there are missing pieces. In your case I would recommend a low amount of drop out in training (since your data is already missing features) and no dropout in test.

It would be difficult to write a helpful example without looking at your data first, but the basics of what an autoencoder does (as well as a complete code implementation) are covered here: http://deeplearning.net/tutorial/dA.html

This link uses a python module called Theano which I would HIGHLY recommend for the job. The flexibility the module trumps every other module I've looked at for Machine Learning and I've looked at a lot. It's not the easiest thing to learn, but if you're going to be doing a lot of this kind of stuff I'd say it's worth the effort. If you don't want to go through all that then you can still implement a denoising autoencoder in Python without it.

Aabglov
  • 508
  • 2
  • 11
  • Thank you a lot for your answer. I've added a data sample. Maybe it will help you too give some additional advices. And I used to believe that tree-based regressors learn to replace missing data. – Eli Korvigo Sep 11 '15 at 19:15
  • Thanks for adding a data sample. This confirms my intuition that a Denoising Autoencoder would be well suited. I'm getting a sense that you're looking for a code sample to reference and the Theano link wasn't sufficient. Are you looking for a more pure-python example or you open to a Theano implementation? – Aabglov Sep 11 '15 at 19:55
  • Glad it helped. If you get stuck with the Neural Network stuff or general Theano coding this video is a really good: https://www.youtube.com/watch?v=S75EdAcXHKk All the code from it is on github. I reference it all the time: https://github.com/Newmu/Theano-tutorials – Aabglov Sep 12 '15 at 23:18