Dealing with non-compusary features in Scikit Learn Decision Trees

Question

In my dataset, there are some features which are not always present:

HW_GRADE: range 0-100 HW_RESUBMISSION: If present, 0-100

In other words, if the student did not resubmit then that feature is absent. As far as I can tell scikit learn doesn't like NaN or blank features. Using interpolation to force a value into that feature doesn't makes sense. I could also create a binary variable 'HW_RESUBMITTED' which would be 0 if HW_RESUBMISSION is NaN. But the actual value, when present, is also a useful discriminator.

The referenced possible duplicate states that missing values are a problem. I agree. In fact, my question is asking for the right way to deal a scenario where interpolation would lead to the wrong results, and simply setting the missing values to a fixed '0' would also lead to incorrect reuse.ts. I propose a possible way to handle this and am looking for someone more advanced than me to comment.

Possible duplicate of [Missing values in scikits machine learning](https://stackoverflow.com/questions/9365982/missing-values-in-scikits-machine-learning) — piman314, Apr 06 '18 at 15:15
I studied that thread carefully before. My question is very different. — pitosalas, Apr 06 '18 at 15:20
I don't think there is a 'right way' to be honest. Without knowing anything about what you are trying to predict or what %age of your dataset is missing this value, I would initially impute a median value and add an addiitonal binary feature 'HW_RESUBMITTED_was_nan' (like you suggested). However, given that these aren't just uncorrelated metrics, you could probably do something a little more intelligent like filling NaNs in 'HW_RESUBMITTED' with the value from 'HW_GRADE'. You'd need to experiment a bit. — Stev, Apr 06 '18 at 15:44
@Stev thanks... good insights. I am glad you were able to understand what I was asking :) By the way is Stackoverflow the best place to ask this kind of question or is there a more specifically focused forum you know of? — pitosalas, Apr 06 '18 at 18:18
There is [Cross validated](https://stats.stackexchange.com/), which has a lot of useful comments. — Stev, Apr 09 '18 at 08:07

Dealing with non-compusary features in Scikit Learn Decision Trees

0 Answers0