I used the Sonar example on the Caret page with the 2 classes sonar classification. The Sonar Class column is a factor with levels ordered as M and R, I changed the order of this factors to R and M and noticed that the predictions changed too, here is my code:
library(mlbench)
library(caret)
data(Sonar)
set.seed(998)
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = (1:30)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
### original data set with Sonar$Class levels : c('M','R')
levels(Sonar$Class)
inTraining_MR <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training_MR <- Sonar[ inTraining_MR,]
testing_MR <- Sonar[-inTraining_MR,]
set.seed(825)
gbmFit_MR <- train(Class ~ ., data = training_MR,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC")
gbmFit_MR
pred_MR = predict(gbmFit_MR, newdata = head(testing_MR))
prob_MR = predict(gbmFit_MR, newdata = head(testing_MR), type = "prob")
res_MR = data.frame(observed = head(Sonar$Class),
predicted = pred_MR,
probM = prob_MR$M,
probR = prob_MR$R)
res_MR
### modified data set with Sonar$Class levels : c('R','M')
Sonar$Class = factor(Sonar$Class, levels=c('R','M'))
levels(Sonar$Class)
set.seed(998)
inTraining_RM <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training_RM <- Sonar[ inTraining_RM,]
testing_RM <- Sonar[-inTraining_RM,]
set.seed(825)
gbmFit_RM <- train(Class ~ ., data = training_RM,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC")
gbmFit_RM
pred_RM = predict(gbmFit_RM, newdata = head(testing_RM))
prob_RM = predict(gbmFit_RM, newdata = head(testing_RM), type = "prob")
res_RM = dataframe(observed = head(Sonar$Class),
predicted = pred_RM,
probM = prob_RM$M,
probR = prob_RM$R)
res_RM
the predictions results:
>levels(Sonar$Class)
[1] "M" "R"
> res_MR
DataFrame with 6 rows and 4 columns
observed predicted probM probR
<factor> <factor> <numeric> <numeric>
1 R R 9.799645e-04 0.9990200355
2 R R 1.825908e-04 0.9998174092
3 R R 5.373401e-08 0.9999999463
4 R R 1.693365e-03 0.9983066351
5 R M 9.999348e-01 0.0000651877
6 R M 9.862454e-01 0.0137546480
> levels(Sonar$Class)
[1] "R" "M"
> res_RM
DataFrame with 6 rows and 4 columns
observed predicted probM probR
<factor> <factor> <numeric> <numeric>
1 R R 0.091199794 0.90880021
2 R R 0.080191807 0.91980819
3 R R 0.005814888 0.99418511
4 R R 0.395159792 0.60484021
5 R R 0.009127547 0.99087245
6 R M 0.966860393 0.03313961
As you can see, gbmFit_MR and gbmFit_Rm produced different models threrefore res_MR and res_RM produced different predictions although they have the same set.seed values.
I imagine that the order of the factors as an impact on the model construction as one of them is the 'positive' or 'case' class as in pRoc package, but I couldn't found where it was mentioned in the Caret documentation?