8

I used caret for logistic regression in R:

  ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, 
                       savePredictions = TRUE)

  mod_fit <- train(Y ~ .,  data=df, method="glm", family="binomial",
                   trControl = ctrl)

  print(mod_fit)

The default metric printed is accuracy and Cohen kappa. I want to extract the matching metrics like sensitivity, specificity, positive predictive value etc. but I cannot find an easy way to do it. The final model is provided but it is trained on all the data (as far as I can tell from documentation), so I cannot use it for predicting anew.

Confusion matrix calculates all required parameters, but passing it as a summary function doesn't work:

  ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, 
                       savePredictions = TRUE, summaryFunction = confusionMatrix)

  mod_fit <- train(Y ~ .,  data=df, method="glm", family="binomial",
                   trControl = ctrl)

Error: `data` and `reference` should be factors with the same levels. 
13.
stop("`data` and `reference` should be factors with the same levels.", 
    call. = FALSE) 
12.
confusionMatrix.default(testOutput, lev, method) 
11.
ctrl$summaryFunction(testOutput, lev, method) 

Is there a way to extract this information in addition to accuracy and kappa, or somehow find it in the train_object returned by the caret train?

Thanks in advance!

missuse
  • 19,056
  • 3
  • 25
  • 47
Cindy Almighty
  • 903
  • 8
  • 20

1 Answers1

14

Caret already has summary functions to output all the metrics you mention:

defaultSummary outputs Accuracy and Kappa
twoClassSummary outputs AUC (area under the ROC curve - see last line of answer), sensitivity and specificity
prSummary outputs precision and recall

in order to get combined metrics you can write your own summary function which combines the outputs of these three:

library(caret)
MySummary  <- function(data, lev = NULL, model = NULL){
  a1 <- defaultSummary(data, lev, model)
  b1 <- twoClassSummary(data, lev, model)
  c1 <- prSummary(data, lev, model)
  out <- c(a1, b1, c1)
  out}

lets try on the Sonar data set:

library(mlbench)
data("Sonar")

when defining the train control it is important to set classProbs = TRUE since some of these metrics (ROC and prAUC) can not be calculated based on predicted class but based on the predicted probabilities.

ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     savePredictions = TRUE,
                     summaryFunction = MySummary,
                     classProbs = TRUE)

Now fit the model of your choice:

mod_fit <- train(Class ~.,
                 data = Sonar,
                 method = "rf",
                 trControl = ctrl)

mod_fit$results
#output
  mtry  Accuracy     Kappa       ROC      Sens      Spec       AUC Precision    Recall         F AccuracySD   KappaSD
1    2 0.8364069 0.6666364 0.9454798 0.9280303 0.7333333 0.8683726 0.8121087 0.9280303 0.8621526 0.10570484 0.2162077
2   31 0.8179870 0.6307880 0.9208081 0.8840909 0.7411111 0.8450612 0.8074942 0.8840909 0.8374326 0.06076222 0.1221844
3   60 0.8034632 0.6017979 0.9049242 0.8659091 0.7311111 0.8332068 0.7966889 0.8659091 0.8229330 0.06795824 0.1369086
       ROCSD     SensSD    SpecSD      AUCSD PrecisionSD   RecallSD        FSD
1 0.04393947 0.05727927 0.1948585 0.03410854  0.12717667 0.05727927 0.08482963
2 0.04995650 0.11053858 0.1398657 0.04694993  0.09075782 0.11053858 0.05772388
3 0.04965178 0.12047598 0.1387580 0.04820979  0.08951728 0.12047598 0.06715206

in this output ROC is in fact the area under the ROC curve - usually called AUC
and AUC is the area under the precision-recall curve across all cutoffs.

missuse
  • 19,056
  • 3
  • 25
  • 47
  • Thanks for an excellent answer. Which of these can not be calculated based on predicted class but based on the predicted probabilities? – Cindy Almighty Oct 08 '18 at 09:13
  • Glad to help. ROC and prAUC require probabilities since they are measures that take into account all possible decision thresholds and not just the 0.5 mark which is usually used. Therefore they are much better measures of model performance when you are dealing with unbalanced classes. – missuse Oct 08 '18 at 09:19
  • Perfect! In trying to fix this I got an error: Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help). I suppose this is due to my "0" and "1" factor naming for most variables. Is there an easy way around this? This is only a problem when classprobs = True – Cindy Almighty Oct 08 '18 at 09:46
  • Convert class level names to words and not numbers prior training - for instance `zero` and `one` instead of `0` and `1`. `class <- ifelse(class == "0", "zero", "one)`. It should work after that. – missuse Oct 08 '18 at 10:05
  • Can you show how to write the custom function ```MySummary```? If I have ```data=htn_data```, ```Class=affirmatory``` and ```Class=negatory``` and ```model=rf_model``` how do I utilize the function? I tried ```MySummary(htn_data, htn_data$Class, rf_model)``` only to get the ```undefined columns selected``` error message. – PleaseHelp May 16 '20 at 19:44
  • @PleaseHelp the function is to be used within `caret::train` as shown in the above example. If this does not help, post another question with detailed explanation preferably with an inbuilt data set. – missuse May 16 '20 at 20:03
  • @missuse Sorry, I think I misunderstood how the example was being used the first time. Your comment helped to clarify the issue to me and I know understand how to use ```MySummary``` in ```caret```. Thanks, it's working for me now! – PleaseHelp May 16 '20 at 20:32