0

I use an example from this question

train$`1stFlrSF`<-train$S1stFlrSF 
train$`2ndFlrSF`<-train$S2ndFlrSF
train$`3SsnPorch`<-train$S3SsnPorch

library("randomForest")
set.seed(1)
rf.model <- randomForest(SalePrice ~ ., 
                         data = train,
                         ntree = 50,
                         nodesize = 5,
                         mtry = 2,
                         importance = TRUE, 
                         metric = "RMSE")

library("caret")
caret.oob.model <- train(train[,-ncol(train)], train$SalePrice, 
                         method = "rf",
                         ntree = 50,
                         tuneGrid = data.frame(mtry = 2),
                         nodesize = 5,
                         importance = TRUE, 
                         metric = "RMSE",
                         trControl = trainControl(method = "oob", seed = 1),
                         allowParallel = FALSE) 

But in caret.oob.model there is an error

Error: Bad seeds: the seed object should be a list of length 2 with 1 integer vectors of size 1 and the last list element having at least a single integer.

it's my dataset https://drive.google.com/file/d/1el-gAgA93EbYnM6VnDqzhT5c5uWsnKvq/view?usp=sharing

How can I solve this problem?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Ekaterina
  • 69
  • 8
  • what is `train[,-ncol(train)]` supposed to do? – dvd280 Aug 24 '20 at 07:59
  • @dvd280 all variables with the exception of SalePrice – Ekaterina Aug 24 '20 at 08:13
  • Make sure to read the `?trainControl` help page. There is a discussion about the seeds= parameter. – MrFlick Aug 24 '20 at 08:22
  • @Ekaterina is `SalesPrice ` the last column in the dataframe? – dvd280 Aug 24 '20 at 08:24
  • The help page of `trainControl` says "an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel. ". Why don't you delete `, seed = 1` part and use `set.seed(1)` before running the coed? – UseR10085 Aug 24 '20 at 08:48
  • @Bappa Das thanks! Why results caret and the random forest are different? – Ekaterina Aug 24 '20 at 08:53

1 Answers1

0

randomForest is a stochastic algorithm which depends on the sampling of rows and columns. Setting the RNG seed allows for reproducible results. For randomForest just one seed prior to calling the training function is sufficient. In caret things are more complicated due to resampling and the fact that more then one model is fitted.

In your case, even without resampling you are fitting two models, one for OOB evaluation of the mtry hyper parameter and the final model.

The help page for ?trainControl states that the seeds argument is an optional set of integers that will be used to set the seed at each resampling iteration.

It is specified as a list of B+1 elements, where B is the number of resamples (exception is "boot632" method). The first B elements of the list should be vectors of integers of length M where M is the number of models being evaluated (in your case 1). The last element of the list only needs to be a single integer (for the final model).

Example:

library(randomForest)
library(caret)

data(mtcars)

set.seed(1)
rf.model <- randomForest(mpg ~ ., 
                         data = mtcars,
                         ntree = 50,
                         nodesize = 5,
                         mtry = 2,
                         importance = TRUE, 
                         metric = "RMSE")

rf.model

Call:
 randomForest(formula = mpg ~ ., data = mtcars, ntree = 50, nodesize = 5,      mtry = 2, importance = TRUE, metric = "RMSE") 
               Type of random forest: regression
                     Number of trees: 50
No. of variables tried at each split: 2

          Mean of squared residuals: 7.353122
                    % Var explained: 79.1

caret.oob.model <- train(mpg ~ ., 
                         data = mtcars, 
                         method = "rf",
                         ntree = 50,
                         tuneGrid = data.frame(mtry = 2),
                         nodesize = 5,
                         importance = TRUE, 
                         metric = "RMSE",
                         trControl = trainControl(method = "oob", seeds = list(1, 1))) 

caret.oob.model$finalModel

Call:
 randomForest(x = x, y = y, ntree = 50, mtry = param$mtry, nodesize = 5,      importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 50
No. of variables tried at each split: 2

          Mean of squared residuals: 7.353122
                    % Var explained: 79.1

Looks to me the models are the same based on the exact same Mean of squared residuals and % Var explained.

missuse
  • 19,056
  • 3
  • 25
  • 47