I am using randomForest package in R platform to build a binary classifier. There are about 30,000 rows with 14,000 being in positive class and 16,000 in negative class. I have 15 variables that have been known to be important for classification.
I have some additional variables (about 5) which have missing information. These variables have values 1 or 0. 1 means presence of something but 0 means that it is not known whether it is present or absent. It is widely known that these variables would be the most important variable for classification (increase reliability of classification and its more likely that the sample lies in positive class) if there is 1 but useless if there is 0. And, only 5% of the rows have value 1. So, one variable is useful for only 5% of the cases. The 5 variables are independent of each other, so I expect that these will be highly useful for 15-25% of the data I have.
Is there a way to make use of available data but neglect the missing/unknown data present in a single column? Your ideas and suggestions would be appreciated. The implementation does not have to be specific to random forest and R-platform. If this is possible using other machine learning techniques or in other platforms, they are also most welcome. Thank you for your time. Regards