1

To illustrate the differences between $finalModel$predicted and the values computed by predict(), I set up the following code:

library(caret)
library(randomForest)

dat <- data.frame(target = c(2.5, 4.5, 6.1, 3.2, 2.2),
              A = c(1.3, 4.4, 5.5, 6.7, 8.1),
              B = c(44.5, 50.1, 23.7, 89.2, 10.5),
              C = c("A", "A", "B", "B", "B"))

control <- trainControl(method="repeatedcv", number=10, repeats=3,     search="grid", savePred =T)

tunegrid <- expand.grid(.mtry=c(1:3))

set.seed(42)
rf_gridsearch <- train(target ~ A + B + C, 
                   data = dat, 
                   method="rf",
                   ntree = 2500, 
                   metric= "RMSE", 
                   tuneGrid=tunegrid, 
                   trControl=control)

dat$pred_caret <- rf_gridsearch$finalModel$predicted

dat$pred <- predict(object = rf_gridsearch, newdata = dat[,2:4])
dat$pred2 <- predict(object = rf_gridsearch$finalModel, newdata = dat[,2:4])

This last line of this code gives the error message

Error in predict.randomForest(object = rf_gridsearch$finalModel, 
newdata = dat[,  : variables in the training data missing in newdata

How is it possible to use $finalModel with predict?

Why does the data in column dat$pred_caret differ from dat$pred? What is the difference between the 2 predictions?

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
yPennylane
  • 760
  • 1
  • 9
  • 27

1 Answers1

2

There already are a lot of questions related to this issue. See

on SO and Question 1, Question 2, Question 3, Question 4, Question 5 on Stats.SE.


As a couple of answers on Stats.SE mention, dat$pred_caret differ from dat$pred because predict.train uses the whole training set, while with predict.randomForest we have that

newdata - a data frame or matrix containing new data. (Note: If not given, the out-of-bag prediction in object is returned.

where rf_gridsearch$finalModel$predicted is basically the same as

randomForest:::predict.randomForest(rf_gridsearch$finalModel)

since rf_gridsearch$finalModel is an object of randomForest class. That is, no newdata gets provided.

As for the error, it relates to the fact that train and randomForest treat data differently. This time it's not about scaling or centering, but rather about creating dummies. In particular, randomForest is looking for the C variable (factor), while train created dummy variable CB <- 1 * (C == "B"). Hence, you may replicate the result of predict.train with

predict(object = rf_gridsearch$finalModel, 
        newdata = model.matrix(~ A + B + C, dat[, 2:4])[, -1])

where

model.matrix(~ A + B + C, dat[, 2:4])[, -1]
#     A    B CB
# 1 1.3 44.5  0
# 2 4.4 50.1  0
# 3 5.5 23.7  1
# 4 6.7 89.2  1
# 5 8.1 10.5  1
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • Thanks for you response. But I provided new data for the predict() function. So why does it return the OOB prediction? – yPennylane Jan 08 '19 at 16:22
  • @yPennylane, I clarified that in my answer, let me know if something is still unclear. You did indeed provide data in `dat$pred2`, but got an error. I explained the error; once we fix it, then it indeed isn't the OOB prediction anymore and coincides with `dat$pred`. – Julius Vainora Jan 08 '19 at 16:40
  • I also provided new data for computing `dat$pred` (`dat$pred <- predict(object = rf_gridsearch, newdata = dat[,2:4])`). That line didn't prouce an error, but gave different values than `rf_gridsearch$finalModel$predicted` – yPennylane Jan 08 '19 at 16:56
  • @yPennylane, 1) `rf_gridsearch` is of class `train` and uses `predict.train`, 2) `rf_gridsearch$finalModel` is of class `randomForest` and uses `predict.randomForest`, 3) due to what I said in my answer (and many linked answers discuss), `rf_gridsearch$finalModel$predicted` gives the same as `predict(rf_gridsearch$finalModel)` and that is OOB, 4) using `predict(rf_gridsearch)` with or without newdata gives not OOB, `predict(rf_gridsearch$finalModel, newdata = ...)` also would give not OOB. – Julius Vainora Jan 08 '19 at 17:01
  • ok. But what is it then, that `predict(rf_gridsearch)` gives? Does it predict values using all the input data (newdata) with the final model? – yPennylane Jan 08 '19 at 17:14
  • @yPennylane, yes, whenever we use `newdata` with any of the two objects, we specify the test data to predict, and the difference only appears how `predict.train` and `predict.randomForest` behave without `newdata`. The former, `predict(rf_gridsearch)`, indeed simply uses all the input data - training set (as `newdata`). – Julius Vainora Jan 08 '19 at 17:21