1

I am trying to use scikit-learn to build a model, and I want to know what the best way is to deal with my particular type of missing features.

I have a base of users, who each need to complete a goal within a given time frame (for example 3 days). I have basic information about each user that is constant throughout. I've trained a simple Random Forest Classifier on this information, and it is so far pretty good at predicting whether the user will complete the goal.

I also have a day-by-day breakdown of completion percentage for all users who have already completed (or not completed). Two samples with one user who completed and one who didn't might look something like this for three days: [[0., 0.58, 1.], [0.2, 0.5, .8]] where each feature is the percentage through achieving the goal. The first user got to 100% within the timeframe, the second didn't.

I want to be able to make the predictions for goal completion on the fly. So if there's a new user who's 1 day through the time limit and 20% of the way to the goal, their data might look like this: [[.2, NaN, NaN]]

The only way I can see integrating this data into the existing model is fitting a different model for each day (model for day 1, model for day 2, etc.). But this is not at all feasible for my production environment. I also thought about trying to impute the missing values (for the above, something like .2, .4, .6), but I know for a fact that the user goal completion tends not to be linear like this.

Is there a good way to train a model with this kind of data? Or an algorithm supported by scikit-learn or another python library that is built for this kind of task? Note that my model also needs to support probability estimates.

TheRuler
  • 400
  • 2
  • 8

1 Answers1

0

If you are having a Time Series data, one way to deal it with efficiently is to break the time series into different parts.

Also, RandomForest have a very interesting property, the model can handle missing values. And for the probability estimates, the predict_proba() method of the RandomForestClassifier can be used. For more details on this you can have a look at the sklearn RandomForest documentation here : http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

enterML
  • 2,110
  • 4
  • 26
  • 38
  • I don't believe that Random Forest can explicitly handle missing values. The documentation you linked says nothing about missing values. Where did you get that information? – TheRuler Sep 18 '16 at 22:00
  • Here are two links for that : [link](http://amateurdatascientist.blogspot.in/2012/01/random-forest-algorithm.html) and [link](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1). I hope that helps !! – enterML Sep 19 '16 at 15:28
  • I don't think that the scikit learn Random Forest supports missing values. See [here](https://stackoverflow.com/questions/9365982/missing-values-in-scikits-machine-learning). – TheRuler Sep 19 '16 at 16:52