8

I have been using a gbm in the caret package in Rstudioto find the probability for the occurrence of a failure.

I have used Youden's J to find a threshold for the best classification, which is 0.63. How do I now use this threshold? I presume the best way to do this is to somehow incorporated the threshold into the gbm model in caret to get more accurate predictions, and then rerun the model on the training data again? Currently it defaults to 0.5 and I can't find an obvious way to update the threshold.

Alternatively, is the threshold just used to separate the test data predictions into the correct class? This seems more straight forward, but how then do I reflect the change in the ROC_AUC plot, assuming the probability should be updated based on the new threshold?

Any help would be gratefully received. Thanks

EDIT: The full code I am working on is as follows:

  
library(datasets)
library(caret)
library(MLeval)
library(dplyr)

data(iris)
data <- as.data.frame(iris)

# create class
data$class <- ifelse(data$Species == "setosa", "yes", "no")

# split into train and test
train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)


# Set up control function for training
ctrl <- trainControl(method = "cv",
                     number = 5, 
                     returnResamp = 'none',
                     summaryFunction = twoClassSummary,
                     classProbs = T,
                     savePredictions = T,
                     verboseIter = F)

# Set up trainng grid - this is based on a hyper-parameter tune that was recently done
gbmGrid <-  expand.grid(interaction.depth = 10,
                        n.trees = 20000,                                          
                        shrinkage = 0.01,                                         
                        n.minobsinnode = 4) 


# Build a standard classifier using a gradient boosted machine
set.seed(5627)
gbm_iris <- train(class ~ .,
                   data = train,
                   method = "gbm",
                   metric = "ROC",
                   tuneGrid = gbmGrid,
                   verbose = FALSE,
                   trControl = ctrl)

# Calcuate best thresholds
caret::thresholder(gbm_iris, threshold = seq(.01,0.99, by = 0.01), final = TRUE, statistics = "all")

pred <- predict(gbm_iris, newdata = test, type = "prob")
roc <- evalm(data.frame(pred, test$class))

missuse
  • 19,056
  • 3
  • 25
  • 47
SB21
  • 95
  • 1
  • 7
  • how did you find the optimal threshold? – missuse Jan 22 '21 at 20:04
  • I found my optimal threshold by using `carets` `thresholder` function as follows: `thres <- caret::thresholder(gbm, threshold = seq(.01,0.99, by = 0.01), final = TRUE, statistics = "all")`. From this I used Youden's J which gave me a value of 0.63, which gave me the best FPR, but also reduced the TPR. – SB21 Jan 23 '21 at 16:04
  • Please see https://stackoverflow.com/questions/65814703/r-can-carettrain-function-for-glmnet-cross-validate-auc-at-fixed-alpha-and-la/65868243#65868243. If something is not clear please rewrite the question to include a reproducible example and I will try to answer it. – missuse Jan 24 '21 at 07:56
  • Thanks missuse, I have updated the post to include a reproducible code. The link you provided was really helpful, and from this I understand that caret does not support changing the model thresholds from .5. Which package could I use in R that would facilitate changing the threshold for a `gbm` model? Is changing the threshold in the model the best approach? Thanks. – SB21 Jan 25 '21 at 09:19
  • The problem with your code is `All_train.rds` is not accessible to us at SO. Could you post a reproducible example with an inbuilt data set. You can change the threshold for prediction in caret by predicting probabilities and setting the threshold manually. – missuse Jan 25 '21 at 09:39
  • I have updated the code to include the iris dataset. I am new to SO and I don't know how to attach data, so I looked at other examples and they suggest using a dataset worked into R. The code above works and shows that Youden's J = 1 for all thresholds. Yet, in my real dataset the highest Youden's J accuracy is for a prob_threshold of 0.63. Really I am interested in using this threshold as the cut off to plot the ROC curve for the test data. Is there any way of doing this? – SB21 Jan 25 '21 at 18:33
  • As you can see I have used MLeval on this occasion, and there is no way of adding a threshold. I have also looked at pROC and ROCR, but I can't seem to find anything. Thanks – SB21 Jan 25 '21 at 18:39

1 Answers1

11

There are several problems in your code. I will use the PimaIndiansDiabetes data set from mlbench since it is better suited then the iris data set.

First of all for splitting data into train and test sets the code:

train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)

is not suited since some rows occurring in the train set will also occur in the test set.

Additionally avoid to use function names as object names, it will save you much headache in the long run.

data(iris)
data <- as.data.frame(iris) #bad object name

To the example:

library(caret)
library(ModelMetrics)
library(dplyr)
library(mlbench)

data(PimaIndiansDiabetes, package = "mlbench")

Create train and test sets, you may use base R sample to sample rows or caret::createDataPartition. createDataPartition is preferable since it tries to preserve the distribution of the response.

set.seed(123)
ind <- createDataPartition(PimaIndiansDiabetes$diabetes, 0.7)


tr <- PimaIndiansDiabetes[ind$Resample1,]
ts <- PimaIndiansDiabetes[-ind$Resample1,]

This way no rows in the train set will be in the test set.

Lets create the model:

ctrl <- trainControl(method = "cv",
                     number = 5, 
                     returnResamp = 'none',
                     summaryFunction = twoClassSummary,
                     classProbs = T,
                     savePredictions = T,
                     verboseIter = F)


gbmGrid <-  expand.grid(interaction.depth = 10,
                        n.trees = 200,                                          
                        shrinkage = 0.01,                                         
                        n.minobsinnode = 4) 

set.seed(5627)
gbm_pima <- train(diabetes ~ .,
                  data = tr,
                  method = "gbm", #use xgboost
                  metric = "ROC",
                  tuneGrid = gbmGrid,
                  verbose = FALSE,
                  trControl = ctrl)

create a vector of probabilities for thresholder

probs <- seq(.1, 0.9, by = 0.02)

ths <- thresholder(gbm_pima,
                   threshold = probs,
                   final = TRUE,
                   statistics = "all")

head(ths)

Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall        F1 Prevalence Detection Rate Detection Prevalence
1     200                10      0.01              4           0.10       1.000  0.02222222      0.6562315      1.0000000 0.6562315  1.000 0.7924209  0.6510595      0.6510595            0.9922078
2     200                10      0.01              4           0.12       1.000  0.05213675      0.6633439      1.0000000 0.6633439  1.000 0.7975413  0.6510595      0.6510595            0.9817840
3     200                10      0.01              4           0.14       0.992  0.05954416      0.6633932      0.8666667 0.6633932  0.992 0.7949393  0.6510595      0.6458647            0.9739918
4     200                10      0.01              4           0.16       0.984  0.07435897      0.6654277      0.7936508 0.6654277  0.984 0.7936383  0.6510595      0.6406699            0.9636022
5     200                10      0.01              4           0.18       0.984  0.14188034      0.6821550      0.8750000 0.6821550  0.984 0.8053941  0.6510595      0.6406699            0.9401230
6     200                10      0.01              4           0.20       0.980  0.17179487      0.6886786      0.8833333 0.6886786  0.980 0.8086204  0.6510595      0.6380725            0.9271018
  Balanced Accuracy  Accuracy      Kappa          J      Dist
1         0.5111111 0.6588517 0.02833828 0.02222222 0.9777778
2         0.5260684 0.6692755 0.06586592 0.05213675 0.9478632
3         0.5257721 0.6666781 0.06435166 0.05154416 0.9406357
4         0.5291795 0.6666781 0.07134190 0.05835897 0.9260250
5         0.5629402 0.6901572 0.15350721 0.12588034 0.8585308
6         0.5758974 0.6979836 0.18460584 0.15179487 0.8288729

extract the threshold probability based on your preferred metric

ths %>%
  mutate(prob = probs) %>%
  filter(J == max(J)) %>%
  pull(prob) -> thresh_prob

thresh_prob
0.74

predict on test data

pred <- predict(gbm_pima, newdata = ts, type = "prob")

create a numeric response (0 or 1) based on the response in the test set since this is needed for the functions from package ModelMetrics

real <- as.numeric(factor(ts$diabetes))-1

ModelMetrics::sensitivity(real, pred$pos, cutoff = thresh_prob)
0.2238806 #based on this it is clear the threshold chosen is not optimal on this test data

ModelMetrics::specificity(real, pred$pos, cutoff = thresh_prob)
0.956

ModelMetrics::kappa(real, pred$pos, cutoff = thresh_prob)
0.2144026  #based on this it is clear the threshold chosen is not optimal on this test data

ModelMetrics::mcc(real, pred$pos, cutoff = thresh_prob)
0.2776309  #based on this it is clear the threshold chosen is not optimal on this test data

ModelMetrics::auc(real, pred$pos)
0.8047463  #decent AUC and low mcc and kappa indicate a poor choice of threshold

Auc is a measure over all thresholds so it does not require specification of the cutoff threshold.

Since only one train/test split is used the performance evaluation will be biased. Best is to use nested resampling so the same can be evaluated over several train/test splits. Here is a way to performed nested resampling.

EDIT: Answer to the questions in comments.

To create the roc curve you do not need to calculate sensitivity and specificity on all thresholds you can just use a specified package for such a task. The results are probability going to be more trustworthy.
I prefer using the pROC package:

library(pROC)

roc.obj <- roc(real, pred$pos)
plot(roc.obj, print.thres = "best")

enter image description here

The best threshold on the figure is the threshold that gives the highest specificity + sensitivity on the test data. It is clear that this threshold (0.289) is much lower compared to the threshold obtained based on cross validated predictions (0.74). This is the reason I said there will be considerable optimistic bias if you adjust the threshold on the cross-validated predictions and use thus obtained performance as an indicator of threshold success.

In the above example not tuning the threshold would have resulted in better performance on the test set. This might hold true in general for the Pima Indians data set or this might be a case of an unfortunate train/test split. So it is best to validate this sort of thing using nested resampling.

missuse
  • 19,056
  • 3
  • 25
  • 47
  • Thanks missue, this is really helpful! Just to clarify, if AUC does not require a cut off threshold, then the ROC curve presented for the test dataset does not need to be updated based the threshold? The reason I ask, is because I have read this in literature..... where the ROC curve is created by altering the discrimination threshold and plotting the TPR against the FPR creating a curve. When I do this, I don't get a curve, rather two straight lines to a single point, as expected for a 1/0 class. I would like to return probs based on updated threshold. How can I do this? – SB21 Jan 26 '21 at 20:51
  • 1
    I indicated one way on how to construct a ROC curve with R with some additional explanations. – missuse Jan 27 '21 at 09:18
  • Thanks missue, this answer is perfect for what I need! – SB21 Jan 28 '21 at 22:20
  • Glad to hear it. Please read this: https://stackoverflow.com/help/someone-answers – missuse Jan 28 '21 at 22:57