2

I am using ramdomForest package to create a random forest model. May data sets are huge with more than a million observations of 200+ variables. While training the random forest with sample data, I am not able to capture all factor levels of all variables.

So while predicting on validation set using predict() it throws an error as new factor levels are present which are not captured in training data.

One solution is to ensure that training data variables contain all factor levels. But this is turning out to be very tedious and I don't really need all factor levels.

Does there exist a way to auto-exclude observations from validation set which contain previous unidentified factor levels while running predict() in randomForest package? Could find any argument for that in the CRAN document. I don't think I can make a reproducible example for this one.

JAL
  • 41,701
  • 23
  • 172
  • 300
Gaurav
  • 1,597
  • 2
  • 14
  • 31
  • But how could you predict levels which don't exist in the training? –  Sep 28 '15 at 06:19
  • I don't mind excluding observations with some levels which occur with very low frequency. I can just ignore that part of data rather while predicting. – Gaurav Sep 28 '15 at 06:21

1 Answers1

3

One solution is to combine Train and Test Matrix and use as.factor on the combined matrix. Then separate into Train and Test again. I had faced this same issue in random forest and this solution had worked for me.

for example :

   combine <- rbind(Train,Test)
   combine$var1 <- as.factor(combine$var1)

   ##Then split into Test and Train
   Train$var1 <- combine[1:nrow(train)]

   similar for Test.

Hope this helps!

Amrita Sawant
  • 10,403
  • 4
  • 22
  • 26
  • Ok this now works, as in it gets rid of persistent errors while training. But it still predicts some value for the observations with variables containing new factor levels (how does that make sense if those levels are not in training set? Or does `randomForest()` have an internal method to deal with new factor levels?) – Gaurav Sep 28 '15 at 07:18
  • How many variables have this problem of new factors in validation ? If too many , then I would think the training data is not a good representation of the entire data. Perhaps you may want to try stratified sampling. Excluding observations, if too many may cause a loss of predictive ability of your model. Post this question on Stats Exchange. – Amrita Sawant Sep 28 '15 at 16:24
  • All variables have the problem but the low frequency factors levels occur in less than 5% of all observations... It's still a big effort to clean all data... but your method seems to work for a start. I suppose I should post a new question for `randomForest()` predicting for observations with new factor levels... – Gaurav Sep 29 '15 at 04:53
  • 1
    From a separate thread, same problem. http://stackoverflow.com/questions/4285214/predict-lm-with-an-unknown-factor-level-in-test-data – Amrita Sawant Sep 29 '15 at 17:20
  • In the real world, and with dynamic data, it may well be impossible to form a dataset containing all possible levels; in which case prediction becomes brittle, and Amrita's link above is the way to go - I suggest adding it to her answer as "case b)". – smci Apr 26 '16 at 23:05