There are several problems in your code. I will use the PimaIndiansDiabetes
data set from mlbench
since it is better suited then the iris
data set.
First of all for splitting data into train and test sets the code:
train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)
is not suited since some rows occurring in the train set will also occur in the test set.
Additionally avoid to use function names as object names, it will save you much headache in the long run.
data(iris)
data <- as.data.frame(iris) #bad object name
To the example:
library(caret)
library(ModelMetrics)
library(dplyr)
library(mlbench)
data(PimaIndiansDiabetes, package = "mlbench")
Create train and test sets, you may use base R sample
to sample rows or caret::createDataPartition
. createDataPartition
is preferable since it tries to preserve the distribution of the response.
set.seed(123)
ind <- createDataPartition(PimaIndiansDiabetes$diabetes, 0.7)
tr <- PimaIndiansDiabetes[ind$Resample1,]
ts <- PimaIndiansDiabetes[-ind$Resample1,]
This way no rows in the train set will be in the test set.
Lets create the model:
ctrl <- trainControl(method = "cv",
number = 5,
returnResamp = 'none',
summaryFunction = twoClassSummary,
classProbs = T,
savePredictions = T,
verboseIter = F)
gbmGrid <- expand.grid(interaction.depth = 10,
n.trees = 200,
shrinkage = 0.01,
n.minobsinnode = 4)
set.seed(5627)
gbm_pima <- train(diabetes ~ .,
data = tr,
method = "gbm", #use xgboost
metric = "ROC",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
create a vector of probabilities for thresholder
probs <- seq(.1, 0.9, by = 0.02)
ths <- thresholder(gbm_pima,
threshold = probs,
final = TRUE,
statistics = "all")
head(ths)
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence
1 200 10 0.01 4 0.10 1.000 0.02222222 0.6562315 1.0000000 0.6562315 1.000 0.7924209 0.6510595 0.6510595 0.9922078
2 200 10 0.01 4 0.12 1.000 0.05213675 0.6633439 1.0000000 0.6633439 1.000 0.7975413 0.6510595 0.6510595 0.9817840
3 200 10 0.01 4 0.14 0.992 0.05954416 0.6633932 0.8666667 0.6633932 0.992 0.7949393 0.6510595 0.6458647 0.9739918
4 200 10 0.01 4 0.16 0.984 0.07435897 0.6654277 0.7936508 0.6654277 0.984 0.7936383 0.6510595 0.6406699 0.9636022
5 200 10 0.01 4 0.18 0.984 0.14188034 0.6821550 0.8750000 0.6821550 0.984 0.8053941 0.6510595 0.6406699 0.9401230
6 200 10 0.01 4 0.20 0.980 0.17179487 0.6886786 0.8833333 0.6886786 0.980 0.8086204 0.6510595 0.6380725 0.9271018
Balanced Accuracy Accuracy Kappa J Dist
1 0.5111111 0.6588517 0.02833828 0.02222222 0.9777778
2 0.5260684 0.6692755 0.06586592 0.05213675 0.9478632
3 0.5257721 0.6666781 0.06435166 0.05154416 0.9406357
4 0.5291795 0.6666781 0.07134190 0.05835897 0.9260250
5 0.5629402 0.6901572 0.15350721 0.12588034 0.8585308
6 0.5758974 0.6979836 0.18460584 0.15179487 0.8288729
extract the threshold probability based on your preferred metric
ths %>%
mutate(prob = probs) %>%
filter(J == max(J)) %>%
pull(prob) -> thresh_prob
thresh_prob
0.74
predict on test data
pred <- predict(gbm_pima, newdata = ts, type = "prob")
create a numeric response (0 or 1) based on the response in the test set since this is needed for the functions from package ModelMetrics
real <- as.numeric(factor(ts$diabetes))-1
ModelMetrics::sensitivity(real, pred$pos, cutoff = thresh_prob)
0.2238806 #based on this it is clear the threshold chosen is not optimal on this test data
ModelMetrics::specificity(real, pred$pos, cutoff = thresh_prob)
0.956
ModelMetrics::kappa(real, pred$pos, cutoff = thresh_prob)
0.2144026 #based on this it is clear the threshold chosen is not optimal on this test data
ModelMetrics::mcc(real, pred$pos, cutoff = thresh_prob)
0.2776309 #based on this it is clear the threshold chosen is not optimal on this test data
ModelMetrics::auc(real, pred$pos)
0.8047463 #decent AUC and low mcc and kappa indicate a poor choice of threshold
Auc is a measure over all thresholds so it does not require specification of the cutoff threshold.
Since only one train/test split is used the performance evaluation will be biased. Best is to use nested resampling so the same can be evaluated over several train/test splits. Here is a way to performed nested resampling.
EDIT: Answer to the questions in comments.
To create the roc curve you do not need to calculate sensitivity and specificity on all thresholds you can just use a specified package for such a task. The results are probability going to be more trustworthy.
I prefer using the pROC package:
library(pROC)
roc.obj <- roc(real, pred$pos)
plot(roc.obj, print.thres = "best")

The best threshold on the figure is the threshold that gives the highest specificity + sensitivity on the test data. It is clear that this threshold (0.289) is much lower compared to the threshold obtained based on cross validated predictions (0.74). This is the reason I said there will be considerable optimistic bias if you adjust the threshold on the cross-validated predictions and use thus obtained performance as an indicator of threshold success.
In the above example not tuning the threshold would have resulted in better performance on the test set. This might hold true in general for the Pima Indians data set or this might be a case of an unfortunate train/test split. So it is best to validate this sort of thing using nested resampling.