I tried recreating a random forest model using caret
and I appear to get slightly different results.
##Set up data
attach(sat.act)
sat.act<- na.omit(sat.act)
#rename outcome and make as factor
sat.act <- sat.act %>% mutate(gender=ifelse(gender==1,"male","female"))
sat.act$gender <- as.factor(sat.act$gender)
#create train and test
set.seed(123)
indexes<-createDataPartition(y=sat.act$gender,p=0.7,list=FALSE)
train<-sat.act[indexes,]
test<-sat.act[-indexes,]
Create a model using 5-fold cv to find the best mtry
set.seed(123)
ctrl <- trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
summaryFunction = twoClassSummary,
classProbs = TRUE)
model <- train(gender ~ ., data=train,
trControl = ctrl,
method= "rf",
preProc=c("center","scale"),
metric="ROC",
importance=TRUE)
> model$finalModel
#Call:
# randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 2
# OOB estimate of error rate: 39%
#Confusion matrix:
# female male class.error
#female 238 72 0.2322581
#male 116 56 0.6744186
Cross validation showed best mtry
is 2. Make another model and input mtry=2
and see the results.
set.seed(123)
ctrl_other <- trainControl(method="none", savePredictions = TRUE, summaryFunction=twoClassSummary, classProbs=TRUE)
model_other <- train(gender ~., data=train, trControl=ctrl_other, importance=TRUE, tuneGrid = data.frame(mtry = 2))
> model_other$finalModel
#Call:
# randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 2
#
# OOB estimate of error rate: 37.34%
#Confusion matrix:
# female male class.error
#female 245 65 0.2096774
#male 115 57 0.6686047
So you can see what appears to be two of the same models (both with mtry=2
and ntree=500
) but you get different results for the final model. Why?