R: can caret::train function for glmnet cross-validate AUC at fixed alpha and lambda?

Question

I would like to calculate the 10-fold cross-validated AUC of an elastic net regression model with the optimal alpha and lambda using caret::train

https://stats.stackexchange.com/questions/69638/does-caret-train-function-for-glmnet-cross-validate-for-both-alpha-and-lambda/69651 explains how to cross-validate alpha and lambda with caret::train

My question on Cross Validated got closed, because it has been classified as a programming question: https://stats.stackexchange.com/questions/505865/r-calculate-the-10-fold-crossvalidated-auc-with-glmnet-and-given-alpha-and-lamb?noredirect=1#comment934491_505865

What I have

Dataset:

library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

# example data
data(PimaIndiansDiabetes, package="mlbench")

# make a training set
set.seed(2323)
train.data <- PimaIndiansDiabetes

My model:

# build a model using the training set
set.seed(2323)
model <- train(
  diabetes ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE),
  tuneLength = 10,
  metric="ROC"
)

Here I get the error:

Warning message:
In train.default(x, y, weights = w, ...) :
  The metric "ROC" was not in the result set. Accuracy will be used instead.

If I ignore the error the best alpha and lambda would be:

model$bestTune
   alpha      lambda
11   0.2 0.002926378

Now I would like to get a 10-fold cross-validated AUC using my model with the best alpha and lambda and the train data.

What I tried

My approach would be something like this, however, I get the error: Something is wrong; all the Accuracy metric values are missing:

model <- train(
  diabetes ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE),
  alpha=model$bestTune$alpha,
  lambda=model$bestTune$lambda,
  tuneLength = 10,
  metric="ROC"
)

How could I calculate a cross-validated AUC using the optimal alpha and lambda and the train data?

I am still not sure how to cross-validate for AUC not, Accuracy.

Thank you for your help.

what happens if you delete the `tuneLength = 10` part from the last chunk of code? — missuse, Jan 22 '21 at 20:00
Thank you. If I deleted it I get Warning messages: `1: In train.default(x, y, weights = w, ...) : The metric "ROC" was not in the result set. Accuracy will be used instead. 2: model fit failed for Fold01: alpha=0.10, lambda=0.04967 Error in (function (x, y, family = c("gaussian", "binomial", "poisson", : formal argument "alpha" matched by multiple actual arguments` — ava, Jan 23 '21 at 14:32
Instead of tuneLength, set `tuneGrid = data.frame(alpha = model$bestTune$alpha, lambda = model$bestTune$lambda)`. however this is not needed since the AUC for these parameters is already contained in the first call where you tuned alpha and lambda - see `model$results`. — missuse, Jan 23 '21 at 14:46
Thank you. I misunderstood that. I did not realize that I actually get the same results if I tuned for alpha and lambda and If I cross-validate with fixed alpha and lambda. Last question, I just realized that I select the optimal model by Accuracy and not by ROC as I actually intend. Can I somehow fix this? — ava, Jan 23 '21 at 15:16

missuse · Accepted Answer · 2021-01-24T15:29:40.263

You intend to use "ROC" - area under the ROC curve to pick the best tuning parameters but you do not specify twoClassSummary() which holds this metric. This is what the warning is informing you

Warning message:
In train.default(x, y, weights = w, ...) :
  The metric "ROC" was not in the result set. Accuracy will be used instead.

Perform turning:

library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

data(PimaIndiansDiabetes, package="mlbench")

set.seed(2323)
train.data <- PimaIndiansDiabetes

set.seed(2323)
model <- train(
  diabetes ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE,
                           summaryFunction = twoClassSummary),
  tuneLength = 10,
  metric="ROC" #ROC metric is in twoClassSummary
)

Since you specified classProbs = TRUE and savePredictions = TRUE you can calculate any metric based on the predictions. The calculate accuracy:

model$pred %>%
  filter(alpha == model$bestTune$alpha,   #filter predictions for best tuning parameters
         lambda == model$bestTune$lambda) %>%
  group_by(Resample) %>% #group by fold
  summarise(acc = sum(pred == obs)/n()) #calculate metric
#output
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 10 x 2
   Resample   acc
   <chr>    <dbl>
 1 Fold01   0.740
 2 Fold02   0.753
 3 Fold03   0.818
 4 Fold04   0.776
 5 Fold05   0.779
 6 Fold06   0.753
 7 Fold07   0.766
 8 Fold08   0.792
 9 Fold09   0.727
10 Fold10   0.789

This gives you per fold metric. To get the average performance

model$pred %>%
  filter(alpha == model$bestTune$alpha,
         lambda == model$bestTune$lambda) %>%
  group_by(Resample) %>%
  summarise(acc = sum(pred == obs)/n()) %>%
  pull(acc) %>%
  mean
#output
0.769566

When ROC is used as a selection metric the hyper parameters are optimized over all decision thresholds. In many cases the chosen model would preform suboptimal using the default decision threshold of 0.5.

Caret has a function thresholder()

it will calculate many metrics based on the resampled data over specified decision thresholds.

thresholder(model, seq(0, 1, length.out = 10)) #in reality I would use length.out = 100

#output

alpha     lambda prob_threshold Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall        F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy  Accuracy
1    0.1 0.03607775      0.0000000       1.000  0.00000000      0.6510595            NaN 0.6510595  1.000 0.7886514  0.6510595      0.6510595            1.0000000         0.5000000 0.6510595
2    0.1 0.03607775      0.1111111       0.994  0.02621083      0.6557464      0.7380952 0.6557464  0.994 0.7901580  0.6510595      0.6471463            0.9869617         0.5101054 0.6562714
3    0.1 0.03607775      0.2222222       0.986  0.15270655      0.6850874      0.8711111 0.6850874  0.986 0.8082906  0.6510595      0.6419344            0.9375256         0.5693533 0.6952837
4    0.1 0.03607775      0.3333333       0.964  0.32421652      0.7278778      0.8406807 0.7278778  0.964 0.8290127  0.6510595      0.6276316            0.8633459         0.6441083 0.7408578
5    0.1 0.03607775      0.4444444       0.928  0.47364672      0.7674158      0.7903159 0.7674158  0.928 0.8395895  0.6510595      0.6041866            0.7877990         0.7008234 0.7695147
6    0.1 0.03607775      0.5555556       0.862  0.59002849      0.7970454      0.7053968 0.7970454  0.862 0.8274687  0.6510595      0.5611928            0.7043575         0.7260142 0.7669686
7    0.1 0.03607775      0.6666667       0.742  0.75740741      0.8521972      0.6114289 0.8521972  0.742 0.7926993  0.6510595      0.4830827            0.5677204         0.7497037 0.7473855
8    0.1 0.03607775      0.7777778       0.536  0.90284900      0.9156149      0.5113452 0.9156149  0.536 0.6739140  0.6510595      0.3489918            0.3828606         0.7194245 0.6640636
9    0.1 0.03607775      0.8888889       0.198  0.98119658      0.9573810      0.3967404 0.9573810  0.198 0.3231917  0.6510595      0.1289474            0.1354751         0.5895983 0.4713602
10   0.1 0.03607775      1.0000000       0.000  1.00000000            NaN      0.3489405       NaN  0.000       NaN  0.6510595      0.0000000            0.0000000         0.5000000 0.3489405
       Kappa          J      Dist
1  0.0000000 0.00000000 1.0000000
2  0.0258717 0.02021083 0.9738516
3  0.1699809 0.13870655 0.8475624
4  0.3337322 0.28821652 0.6774055
5  0.4417759 0.40164672 0.5329805
6  0.4692998 0.45202849 0.4363768
7  0.4727251 0.49940741 0.3580090
8  0.3726156 0.43884900 0.4785352
9  0.1342372 0.17919658 0.8026597
10 0.0000000 0.00000000 1.0000000

Now pick a threshold based on your desired metric and use that. Usually the metrics used with imbalanced data Cohen's Kappa, Youden's J or Matthews correlation coefficient (MCC). Here is a decent paper on the matter.

Please note that since this data was used to find the optimal threshold the performance obtained this way will be optimistically biased. To evaluate the performance of the picked decision threshold it would be best to use several independent test sets. In other words I recommend nested resampling where you would optimize the parameters and threshold using the inner folds and evaluate on the outer folds.

Here is an explanation on how to use nested resampling with caret with regression. Some modifications are needed to make it work with classification with optimized threshold.

Please note that this is not the only way to pick the best decision threshold. Another way is to pick the desired metric a priori (MCC for instance) and treat the decision threshold as a hyper parameter which is to be tuned jointly with all the other hyper parameters. I trust this is not supported with caret with creating custom models.

Thank you for your answer. I have some questions: the tuning you proposed does not select the model parameters (alpha and lambda) that give the highest AUC, right? Is this somehow possible? If yes, I could just calculate AUC per fold (as you did for acc) and take the average as cross-validated AUC, right? — ava, Jan 27 '21 at 18:25
The tuning I propose does select the alpha and lambda that produce the highest AUC. If you calculated the AUC per fold and averaged them you would get the same result as caret does in the returned train object. — missuse, Jan 27 '21 at 18:34
Thank you again for your answers. Unfortunately, I am still struggling with the nested resampling. If you had time have a look at my [follow-up question](https://stackoverflow.com/questions/66621119/building-a-nested-logistic-regression-model-using-caret-glmnet-and-a-nested-c), I would be very thankful! — ava, Mar 14 '21 at 04:46

R: can caret::train function for glmnet cross-validate AUC at fixed alpha and lambda?

1 Answers1

Linked