Error when using predict() on a randomForest object trained with caret's train() using formula

Question

Using R 3.2.0 with caret 6.0-41 and randomForest 4.6-10 on a 64-bit Linux machine.

When trying to use the predict() method on a randomForest object trained with the train() function from the caret package using a formula, the function returns an error. When training via randomForest() and/or using x= and y= rather than a formula, it all runs smoothly.

Here is a working example:

library(randomForest)
library(caret)

data(imports85)
imp85     <- imports85[, c("stroke", "price", "fuelType", "numOfDoors")]
imp85     <- imp85[complete.cases(imp85), ]
imp85[]   <- lapply(imp85, function(x) if (is.factor(x)) x[,drop=TRUE] else x) ## Drop empty levels for factors.

modRf1  <- randomForest(numOfDoors~., data=imp85)
caretRf <- train( numOfDoors~., data=imp85, method = "rf" )
modRf2  <- caretRf$finalModel
modRf3  <- randomForest(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"])
caretRf <- train(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"], method = "rf")
modRf4  <- caretRf$finalModel

p1      <- predict(modRf1, newdata=imp85)
p2      <- predict(modRf2, newdata=imp85)
p3      <- predict(modRf3, newdata=imp85)
p4      <- predict(modRf4, newdata=imp85)

Among the last 4 lines, only the second one p2 <- predict(modRf2, newdata=imp85) returns the following error:

Error in predict.randomForest(modRf2, newdata = imp85) : 
variables in the training data missing in newdata

It seems that the reason for this error is that the predict.randomForest method uses rownames(object$importance) to determine the name of the variables used to train the random forest object. And when looking at

rownames(modRf1$importance)
rownames(modRf2$importance)
rownames(modRf3$importance)
rownames(modRf4$importance)

We see:

[1] "stroke"   "price"    "fuelType"
[1] "stroke"   "price"    "fuelTypegas"
[1] "stroke"   "price"    "fuelType"
[1] "stroke"   "price"    "fuelType"

So somehow, when using the caret train() function with a formula changes the name of the (factor) variables in the importance field of the randomForest object.

Is it really an inconsistency between the formula and and non-formula version of the caret train() function? Or am I missing something?

`modRf3 <- randomForest(x=dataTrain[,c("stroke", "price", "fuelType")], y=dataTrain[, "numOfDoors"], data=imp85) Error in randomForest(x = dataTrain[, c("stroke", "price", "fuelType")], : object 'dataTrain' not found` — , May 07 '15 at 10:02
As pointed out, you did not define `dataTrain` in your example which means the problem is not [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It's not easy to help you if we can't run the code and get the same results as you. — MrFlick, May 07 '15 at 12:52
My bad, `dataTrain` should have been `imp85`, I edited the code in the original question. I also removed the option `data=imp85` in the call where `x` and `y` are explicitly mentioned as there is no use for it. — Adrien Combaz, May 07 '15 at 14:00

score 36 · Accepted Answer · edited Apr 08 '22 at 07:41

First, almost never use the $finalModel object for prediction. Use predict.train. This is one good example of why.

There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.

So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).

The error occurs because fuelType is a factor. The dummy variables created by train don't have the same names so predict.randomForest can't find them.

Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.

TL;DR

Use the non-formula method with train if you want the same levels or use predict.train

I unfortunately don't have enough reputation to upvote your answer, but you answered perfectly my question. I had been wondering for all those functions that allow to use formula, if there was a difference in the way the data was treated between the formula and non-formula versions of the function call. Now I know! For the use of `$finalModel`, I agree that it's generally not a good idea to use it. Here I just wanted to compare the outcome of the `caret` and `randomForest` methods. — Adrien Combaz, May 11 '15 at 18:56

score 0 · Answer 2 · answered Jan 05 '17 at 07:46

There can be two reasons why you get this error.

1. The categories of the categorical variables in the train and test sets don't match. To check that, you can run something like the following.

Well, first of all, it is good practice to keep the independent variables/features in a list. Say that list is "vars". And say, you separated "Data" into "Train" and "Test". Let's go:

for (v in vars){
  if (class(Data[,v]) == 'factor'){
    print(v)
    # print(levels(Train[,v])) 
    # print(levels(Test[,v]))
    print(all.equal(levels(Train[,v]) , levels(Test[,v])))
  }  
}

Once you find the non-matching categorical variables, you can go back, and impose the categories of Test data onto Train data, and then re-build your model. In a loop similar to above, for each nonMatchingVar, you can do

levels(Test$nonMatchingVar) <- levels(Train$nonMatchingVar)

2. A silly one. If you accidentally leave the dependent variable in the set of independent variables, you may run into this error message. I have done that mistake. Solution: Just be more careful.

score 0 · Answer 3 · edited Feb 27 '18 at 22:20

0

Another way is to explicitly code the testing data using model.matrix, e.g.

p2 <- predict(modRf2, newdata=model.matrix(~., imp85))

edited Feb 27 '18 at 22:20

t j

7,026
12
46
66

answered Feb 27 '18 at 21:41

WeimusT

41
1
5

Error when using predict() on a randomForest object trained with caret's train() using formula

3 Answers3

Linked