1

I am using randomForest package in R platform to build a binary classifier. There are about 30,000 rows with 14,000 being in positive class and 16,000 in negative class. I have 15 variables that have been known to be important for classification.

I have some additional variables (about 5) which have missing information. These variables have values 1 or 0. 1 means presence of something but 0 means that it is not known whether it is present or absent. It is widely known that these variables would be the most important variable for classification (increase reliability of classification and its more likely that the sample lies in positive class) if there is 1 but useless if there is 0. And, only 5% of the rows have value 1. So, one variable is useful for only 5% of the cases. The 5 variables are independent of each other, so I expect that these will be highly useful for 15-25% of the data I have.

Is there a way to make use of available data but neglect the missing/unknown data present in a single column? Your ideas and suggestions would be appreciated. The implementation does not have to be specific to random forest and R-platform. If this is possible using other machine learning techniques or in other platforms, they are also most welcome. Thank you for your time. Regards

Abhishek
  • 279
  • 2
  • 5
  • 18
  • probably :) can you [provide example data](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? thanx – Anthony Damico Dec 20 '12 at 10:04
  • If you have only the option present and unknown, you can do two things: see 'unknown' as a useful category, indicating something, or see 'unknown' as missing data. If you do the latter, these variables will contain only a single non-missing value and are hence useless in any kind of analysis. Never mind the fact that the missingness in your data will throw out the majority of the cases you have now. So you can, but it's unlikely to be even a remotely good idea. – Joris Meys Dec 20 '12 at 10:07
  • thank you for your comment. In the variables where there are presence and unknown, if there is presence, I know that it will contribute the prediction to be in a certain class. So, if there is 1, its highly likely to fall in positive class but if its 0, the classifier should depend totally on other variables. – Abhishek Dec 20 '12 at 10:30

2 Answers2

1

I can see at least the following approaches. Personally, I prefer the third option.

1) Discard the extra columns

You can choose to discard those 5 extra columns. Obviously this is not optimal, but it is good to know the performance of this option, to compare with the following.

2) Use the data as it is

In this case, those 5 extra columns are left as they are. The definite presence (1) or unknown presence/absence (0) in each of those 5 columns is used as information. This is the same as saying "if I'm not sure whether something is present or absent, I'll treat it as absent". I know this is obvious, but if you haven't tried this, you should, to compare it to option 1.

3) Use separate classifiers

If around 95% of each of those 5 columns has zeroes, and they are roughly independent of each other, that's 0.95^5 = 77.38% of data (roughly 23200 rows) which has zeroes in ALL of those columns. You can train a classifier on those 23200 rows, removing the 5 columns which are all zeroes (since those columns are equal for all points, they have zero predictive utility anyway). You can then train a separate classifier for the remaining points, which will have at least one of those columns set to 1. For these points, you leave the data as it is.

Then, for your test point, if all those columns are zeroes you use the first classifier, otherwise you use the second.

Other tips

If the 15 "normal" variables are not binary, make sure you use a classifier which can handle variables with different normalizations. If you're not sure, normalize the 15 "normal" variables to lie in the interval [0,1] -- you probably won't lose anything by doing this.

HerrKaputt
  • 2,594
  • 18
  • 17
  • Thank you HerrKaputt. The first option is okey, thats what I have now. But, I want to make use of some more available data. The second option is a bit misleading. The third option is a nice idea which I am thinking about. I want to add one more thing in my question. If there is 1 in the column, that sample is more likely to be in positive class (this is also known). Does this induce any new idea?? – Abhishek Dec 20 '12 at 11:02
  • Try the second and third options then. Both should involve very little extra work relative to the first (basically removing rows/columns from your data). Then let us know what values you get, and we'll work from there. – HerrKaputt Dec 20 '12 at 11:07
  • I have tried the second option already although its not that convincing. It did not really improve the performance. I will try the third option and let you know. By the way did you see the edited version of my first comment? – Abhishek Dec 20 '12 at 11:09
  • I hadn't seen it, but I saw it now. It makes option 3 seem even more interesting than I thought. – HerrKaputt Dec 20 '12 at 11:28
0

I'd like to add a further suggestion to Herr Kapput's: if you use a probabilistic approach, you can treat "missing" as a value which you have a certain probability of observing, either globally or within each class (not sure which makes more sense). If it's missing, it has probability of occurring p(missing), and if it's present it has probability p(not missing) * p(val | not missing). This allows you to gracefully handle the case where the values have arbitrary range when they are present.

Ben Allison
  • 7,244
  • 1
  • 15
  • 24