0

I tried recreating a random forest model using caret and I appear to get slightly different results.

##Set up data
attach(sat.act)
sat.act<- na.omit(sat.act)
#rename outcome and make as factor
sat.act <- sat.act %>% mutate(gender=ifelse(gender==1,"male","female"))
sat.act$gender <- as.factor(sat.act$gender)

#create train and test
set.seed(123)
indexes<-createDataPartition(y=sat.act$gender,p=0.7,list=FALSE)
train<-sat.act[indexes,]
test<-sat.act[-indexes,]

Create a model using 5-fold cv to find the best mtry

set.seed(123)
ctrl <- trainControl(method = "cv",
                      number = 5,
                      savePredictions = TRUE,
                      summaryFunction = twoClassSummary,
                      classProbs = TRUE)

model <- train(gender ~ ., data=train, 
                  trControl = ctrl, 
                  method= "rf", 
                  preProc=c("center","scale"), 
                  metric="ROC",
                  importance=TRUE)


> model$finalModel

#Call:
# randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
#               Type of random forest: classification
#                     Number of trees: 500
#No. of variables tried at each split: 2

#        OOB estimate of  error rate: 39%
#Confusion matrix:
#       female male class.error
#female    238   72   0.2322581
#male      116   56   0.6744186

Cross validation showed best mtry is 2. Make another model and input mtry=2 and see the results.

set.seed(123)
ctrl_other <- trainControl(method="none", savePredictions = TRUE, summaryFunction=twoClassSummary, classProbs=TRUE)

model_other <- train(gender ~., data=train, trControl=ctrl_other, importance=TRUE, tuneGrid = data.frame(mtry = 2))



> model_other$finalModel

#Call:
# randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
#               Type of random forest: classification
#                     Number of trees: 500
#No. of variables tried at each split: 2
#
#        OOB estimate of  error rate: 37.34%
#Confusion matrix:
#       female male class.error
#female    245   65   0.2096774
#male      115   57   0.6686047

So you can see what appears to be two of the same models (both with mtry=2 and ntree=500) but you get different results for the final model. Why?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
PleaseHelp
  • 124
  • 11
  • What about `preProc=c("center","scale")` in the 2nd attempt? – desertnaut Jun 11 '20 at 15:43
  • Yup, I added that addition and still get dissimilar results. I checked the linked question to this answer and while it compares caret to the random forest package, it doesn't quite explain the difference between cv and no cv as in this question – PleaseHelp Jun 11 '20 at 16:28
  • I reopened it; please update your post to include the results with `preProc=c("center","scale")` – desertnaut Jun 11 '20 at 16:29
  • 1
    the models differ because the algorithm is stochastic and the seed is not set at the same point prior building the final model in the two cases. In the second case seed is set prior to building the final model while in the first case the seed is set prior to tuning (quite a bit of models are fit prior to the final one). – missuse Jun 11 '20 at 17:51

0 Answers0