0

When using caret's train function to fit GBM classification models, the function predictionFunction converts probabilistic predictions into factors based on a probability threshold of 0.5.

      out <- ifelse(gbmProb >= .5, modelFit$obsLevels[1], modelFit$obsLevels[2])
      ## to correspond to gbmClasses definition above

This conversion seems premature if a user is trying to maximize the area under the ROC curve (AUROC). While sensitivity and specificity correspond to a single probability threshold (and therefore require factor predictions), I'd prefer AUROC be calculated using the raw probability output from gbmPredict. In my experience, I've rarely cared about the calibration of a classification model; I want the most informative model possible, regardless of the probability threshold over which the model predicts a '1' vs. '0'. Is it possible to force raw probabilities into the AUROC calculation? This seems tricky, since whatever summary function is used gets passed predictions that are already binary.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
user3215964
  • 3
  • 1
  • 2

1 Answers1

5

"since whatever summary function is used gets passed predictions that are already binary"

That's definitely not the case.

It cannot use the classes to compute the ROC curve (unless you go out of your way to do so). See the note below.

train can predict the classes as factors (using the internal code that you show) and/or the class probabilities.

For example, this code will compute the class probabilities and use them to get the area under the ROC curve:

library(caret)
library(mlbench)
data(Sonar)

ctrl <- trainControl(method = "cv", 
                     summaryFunction = twoClassSummary, 
                     classProbs = TRUE)
set.seed(1)
gbmTune <- train(Class ~ ., data = Sonar,
                 method = "gbm",
                 metric = "ROC",
                 verbose = FALSE,                    
                 trControl = ctrl)

In fact, if you omit the classProbs = TRUE bit, you will get the error:

train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()

Max

topepo
  • 13,534
  • 3
  • 39
  • 52
  • Thanks, Max. I didn't realize that both the factor predictions and the class probabilities were included in the `data` argument of the summary function, and that this allows both the _full_ AUROC and 0.5-threshold sensitivity/specificity to be calculated. – user3215964 Jan 21 '14 at 15:41
  • One other detail... the class probabilities are added as different columns (one for each class) so make sure that the classes are valid R names (e.g. not `"0"`, `"1"` etc). Also [here](http://appliedpredictivemodeling.com/blog/2013/8/15/equivocal-zones) is an example that redefines the classes based on the class probability values that may be helpful. – topepo Jan 22 '14 at 00:01