I am using ramdomForest
package to create a random forest model. May data sets are huge with more than a million observations of 200+ variables. While training the random forest with sample data, I am not able to capture all factor levels of all variables.
So while predicting on validation set using predict()
it throws an error as new factor levels are present which are not captured in training data.
One solution is to ensure that training data variables contain all factor levels. But this is turning out to be very tedious and I don't really need all factor levels.
Does there exist a way to auto-exclude observations from validation set which contain previous unidentified factor levels while running predict()
in randomForest package? Could find any argument for that in the CRAN document. I don't think I can make a reproducible example for this one.