randomForest does not work when training set has more different factor levels than test set

Question

When trying to test my trained model on new test data that has fewer factor levels than my training data, predict() returns the following:

Type of predictors in new data do not match that of the training data.

My training data has a variable with 7 factor levels and my test data has that same variable with 6 factor levels (all 6 ARE in the training data).

When I add an observation containing the "missing" 7th factor, the model runs, so I'm not sure why this happens or even the logic behind it.

I could see if the test set had more/different factor levels, then randomForest would choke, but why in the case where training set has "more" data?

Because if the levels don't match exactly, they could be coded differently. Factors associate labels to integers. So "Male" could be 1 in one set and 2 in another if the factors were created differently. This means you could potentially be predicting to something other than what you expected. R just confirms that all the levels are the same to be safe. You don't need to add observations to make them match, you just need to adjust the `levels()` of the factor. — MrFlick, Jul 21 '14 at 18:56
Thanks for the answer. When I run levels(train$data) and levels(test$data), the numbers line up except the train$data has an extra factor at the end. Does this mean I have to manually drop that level every time? — bmcarterr, Jul 21 '14 at 19:46
All the levels must match. You don't have to drop that level, you just need to add that level to the factor in the test data. You can add levels without adding observations. You can do `test$val <- factor(test$val, levels=levels(train$val))` or something like that. You don't exactly have a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) here so it's difficult to be specific — MrFlick, Jul 21 '14 at 19:50

score 8 · Accepted Answer · answered Jul 21 '14 at 19:57

8

R expects both the training and the test data to have the exact same levels (even if one of the sets has no observations for a given level or levels). In your case, since the test dataset is missing a level that the train has, you can do

test$val <- factor(test$val, levels=levels(train$val))

to make sure it has all the same levels and they are coded the same say.

(reposted here to close out the question)

answered Jul 21 '14 at 19:57

MrFlick

195,160
17
277
295

I'm a bit confused about this. https://stackoverflow.com/questions/35595499/consistent-factor-levels-for-same-value-over-different-datasets seems to show that assuring consistency of the levels is not necessary. Is it the case that some fitted models carry around the categorical encoding of the `x`s (and so the order doesn't matter) but some don't? – cd98 Nov 27 '17 at 14:39
1

@cd98 Well, `predict()` is a generic function and it looks like `predict.lm()` must be more tolerant of differing levels. In my experience it seems that most `predict()` functions expect identical levels however. – MrFlick Nov 27 '17 at 15:17

randomForest does not work when training set has more different factor levels than test set

1 Answers1

Linked