11

I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking.

I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call. These options generate different results, but neither matches the randomForest() model.

Although I’ve read the caret Package website (http://topepo.github.io/caret/index.html), as well as various StackOverflow questions that seem potentially relevant, but I haven’t been able to figure out why the caret method = "rf" model produces different results from randomForest(). Thank you very much for any insight you might be able to offer.

Here’s a replicable example, using the CO2 dataset from the MASS package.

library(MASS)
data(CO2)

library(randomForest)
set.seed(1)
rf.model <- randomForest(uptake ~ ., 
                       data = CO2,
                       ntree = 50,
                       nodesize = 5,
                       mtry=2,
                       importance=TRUE, 
                       metric="RMSE")

library(caret)
set.seed(1)
caret.oob.model <- train(uptake ~ ., 
                     data = CO2,
                     method="rf",
                     ntree=50,
                     tuneGrid=data.frame(mtry=2),
                     nodesize = 5,
                     importance=TRUE, 
                     metric="RMSE",
                     trControl = trainControl(method="oob"),
                     allowParallel=FALSE)

set.seed(1)
caret.boot.model <- train(uptake ~ ., 
                     data = CO2,
                     method="rf",
                     ntree=50,
                     tuneGrid=data.frame(mtry=2),
                     nodesize = 5,
                     importance=TRUE, 
                     metric="RMSE",
                     trControl=trainControl(method="boot", number=50),
                     allowParallel=FALSE)

 print(rf.model)
 print(caret.oob.model$finalModel) 
 print(caret.boot.model$finalModel)

Produces the following:

print(rf.model)

      Mean of squared residuals: 9.380421
                % Var explained: 91.88

print(caret.oob.model$finalModel)

      Mean of squared residuals: 38.3598
                % Var explained: 66.81

print(caret.boot.model$finalModel)

      Mean of squared residuals: 42.56646
                % Var explained: 63.16

And the code to look at variable importance:

importance(rf.model)

importance(caret.oob.model$finalModel)

importance(caret.boot.model$finalModel)
ej5607
  • 309
  • 2
  • 12
  • Short of using your actual data and trying to reproduce the models you built, have you examined the behavior of the caret and randomForest models? If both exhibit very similar important predictors, with similar weights, then you might be less concerned about other variations. – Tim Biegeleisen Apr 18 '16 at 14:39
  • Hi Tim - Thank you for your time and input. I have looked at the variable importance (I updated the code above to reflect this), and I'm getting different weights for the predictors. Even if the weights weren't that different, though, I'd still want to understand what was causing the differences. When I can't explain something, I'm always concerned about what I don't know that I don't know! – ej5607 Apr 18 '16 at 15:09
  • 1
    Using formula interface in train converts factors to dummy. To compare with randomForest you should use the non-formula interface. For example `train(CO2[, -5], CO2$uptake, method="rf", ...)` – Lluís Ramon Apr 18 '16 at 15:25
  • Hi Lluís - Thank you for that suggestion! The results from the caret model are now much closer to the original randomForest model. They still aren't exact, though. Is there anything else that might be going on here? – ej5607 Apr 18 '16 at 20:23

1 Answers1

12

Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.

In your case, you should provide a seed inside trainControl to get the same result as in randomForest.

Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.

library("randomForest")
set.seed(1)
rf.model <- randomForest(uptake ~ ., 
                         data = CO2,
                         ntree = 50,
                         nodesize = 5,
                         mtry = 2,
                         importance = TRUE, 
                         metric = "RMSE")

library("caret")
caret.oob.model <- train(CO2[, -5], CO2$uptake, 
                         method = "rf",
                         ntree = 50,
                         tuneGrid = data.frame(mtry = 2),
                         nodesize = 5,
                         importance = TRUE, 
                         metric = "RMSE",
                         trControl = trainControl(method = "oob", seed = 1),
                         allowParallel = FALSE)

If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.

In the following example, the last seed is for the final model and I set it to 1.

seeds <- as.vector(c(1:26), mode = "list")

# For the final model
seeds[[26]] <- 1

caret.boot.model <- train(CO2[, -5], CO2$uptake, 
                          method = "rf",
                          ntree = 50,
                          tuneGrid = data.frame(mtry = 2),
                          nodesize = 5,
                          importance = TRUE, 
                          metric = "RMSE",
                          trControl = trainControl(method = "boot", seeds = seeds),
                          allowParallel = FALSE)

Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:

rf.model
caret.oob.model$final
caret.boot.model$final
Lluís Ramon
  • 576
  • 4
  • 7
  • The mean of squared residuals and variance explained by the model now match exactly between the randomForest and caret models. The %IncMSE predictor importance estimates remain different, though, even with the non-formula interface. What might explain this? – ej5607 Apr 19 '16 at 14:59
  • I get the same results in `importance(rf.model)`, `importance(caret.oob.model$finalModel)` and `importance(caret.boot.model$finalModel)`. – Lluís Ramon Apr 19 '16 at 15:05
  • 1
    For correctly comparing you should use `importance` function in `rf.model`and `caret.oob.model$finalModel`, as I did. Are you doing this? – Lluís Ramon Apr 19 '16 at 15:35
  • Hi Lluís - Ah, that's it! Yes, you're right. I wasn't doing that for the caret packages. I've changed my code and now everything matches. Thank you so very much, Lluís! – ej5607 Apr 19 '16 at 15:40
  • How can we set `seed` to `randomForest` package built model? I dont see any specific argument for this nor `trainControl` – joel.wilson Sep 01 '17 at 03:34