0

I'm new to cross-validation. To understand the process, I just did k-10 cross-validation with 10 repeats, using the Caret package on my data using logistic regression with two predictors:

# creating hypothetical data  
x1 = rnorm(47, mean = 45, sd = 5); x2 = rnorm(107, mean = 50, sd = 3); 
z1 = rnorm(47, mean = 13, sd = 1.3); z2 = rnorm(107, mean = 15, sd = 2.3)
Y0 = rep("No", 107);Y1 = rep("Yes",47)
a = c(x2,x1); b = c(z2,z1); c = c(Y0,Y1); 
data = as.data.frame(cbind(c,a,b)); names(data) = c("Group","Class1","Class2"); data$Group = as.factor(data$Group)

library(caret);library(pROC)
set.seed(1); train.control <- trainControl(method="repeatedcv", number=10, repeats = 10,
                          summaryFunction = twoClassSummary, 
                          savePredictions = "all",classProbs = TRUE)                
set.seed(1); model <- train(Group ~ Class1+Class2, data = data, method = "glm", family="binomial",
             trControl = train.control) 

Learning from previous a post, I figured the following script would produce an "overall" cross-validated AUC plot

plot(roc(response = model$pred$obs, predictor = model$pred$Yes))

But this plot produces the concatenated observation/predicted value of all 10k x 10 repeats iterations, not something I want. Since the output AUC is an average of AUC of iterations, which is the same as mean(model$Resample$ROC), I want to plot this average AUC.
Looking at the output of model$pred, I understand that the first 15 or so rows are the first fold of a specific repeat (as in model$pred$Resample), and the next 15 or so iteration is the second, and so on.
I want to take/segment the vectors of predicted and observed values for each fold.Repeat., calculate the mean for each point, and then plot the average of ROC (cross-validated ROC/AUC). I conceptually get it, but I can't come up with the specific code (especially because the length of each iteration is different).

I also want to obtain 95%CI of the AUC and do the same process with another different variable and overlap the two cross-validated average AUCs to visually show the differences. Can someone solve this, please? Thank you!

Michael Kuhn
  • 8,302
  • 6
  • 26
  • 27
MAMY
  • 11
  • 3
  • Please provide a https://stackoverflow.com/help/minimal-reproducible-example. We can't reproduce your plot without `data`. And while one can guess that your code needs the `library(caret)` to work, it is unclear which `roc` function you use (from which package). – scrameri Sep 19 '21 at 20:12
  • @scrameri I'm sorry, I can't provide the data. Thank you for your comments, though. I don't have any issue with the script (roc is from pROC and all necessary packages are installed). My question is, from the output of caret (model$pred), how to make a code to extract/segment the predicted (model$pred$CLASSNAME) and observed values (model$pred$obs) for each fold-repeat (model$pred$Resample) and get the average from each fold-repeat and plot that. Because there are 10 x 10 validation, it takes forever without a function. But the length of each validation set is different, so I'm stuck. – MAMY Sep 19 '21 at 22:47
  • @scrameri, I just created a sample data so the script would run and edited the post. Thanks for your advice. – MAMY Sep 19 '21 at 23:29

1 Answers1

3

If I understand the first part of your question correctly, you'd like to extract some statistics from each resample, and summarise them in some way.

To achive this, you could sapply a procedure FUN over all elements X, where X are your resamples. This way, it doesn't matter how many samples are predicted in each resample.

However, I'm not exactly sure which statistics you want to extract from the resamples (mean from each point?). In the following, I assume that you like to get the ROC curve for each resample.

apply cross-validation to logistic regression model

library(caret)
library(pROC)
library(tidyverse)

train.control <- trainControl(method="repeatedcv", number=10, repeats = 10,
                              summaryFunction = twoClassSummary, 
                              savePredictions = "all",classProbs = TRUE)                

set.seed(1)
model <- train(Group ~ Class1 + Class2, data = data, method = "glm",
               family = "binomial", trControl = train.control)
               
#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.

per-resample ROC (receiver operating characteristic) curve

Such an sapply solution should work regardless of how large each resample is, and for whatever you'd like to do in FUN, as long as FUN outputs a data.frame.

# sapply roc(), bind as tibble with Resample as .id
dd.roc <- sapply(X = unique(model$pred$Resample),
                 FUN = function(x) {
                   r <- model$pred[model$pred$Resample == x,]
                   R <- roc(response = r$obs, predictor = r$Yes)
                   data.frame(sensitivities = R$sensitivities,
                              specificities = R$specificities)
                 }, simplify = F) %>%
  bind_rows(.id = "Resample") %>%
  as_tibble() %>%
  arrange(specificities)
#> Setting levels: control = No, case = Yes
#> Setting direction: controls < cases
#> [many more such warnings]


dd.roc
#> # A tibble: 1,640 × 3
#>    Resample     sensitivities specificities
#>    <chr>                <dbl>         <dbl>
#>  1 Fold01.Rep01             1             0
#>  2 Fold02.Rep01             1             0
#>  3 Fold03.Rep01             1             0
#>  4 Fold04.Rep01             1             0
#>  5 Fold05.Rep01             1             0
#>  6 Fold06.Rep01             1             0
#>  7 Fold07.Rep01             1             0
#>  8 Fold08.Rep01             1             0
#>  9 Fold09.Rep01             1             0
#> 10 Fold10.Rep01             1             0
#> # … with 1,630 more rows

get overall ROC (receiver operating characteristic) curve

d.roc <- roc(response = model$pred$obs, predictor = model$pred$Yes)
#> Setting levels: control = No, case = Yes
#> Setting direction: controls < cases

d.roc <- data.frame(sensitivities = d.roc$sensitivities,
                    specificities = d.roc$specificities)

plot

ggplot(dd.roc, aes(x=specificities,y=sensitivities))+
  # geom_point(colour = "tomato", alpha = 0.1) +
  # geom_density2d() +
  geom_path(aes(group = Resample, colour = "ROC per resample"), alpha = 0.1) +
  geom_smooth(colour = "tomato", size = 1.5) +
  geom_line(data = d.roc, aes(colour = "ROC over all resamples"), size = 1.5) +
  ylim(0,1) +
  geom_abline(aes(slope = 1, intercept = 1)) +
  scale_x_reverse(limit = c(1,0)) +
  scale_colour_manual(values = c("seagreen","tomato"), name = "") +
  theme_classic() +
  theme(legend.position = "bottom")
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

toy data

x1 = rnorm(47, mean = 45, sd = 5); x2 = rnorm(107, mean = 50, sd = 3); 
z1 = rnorm(47, mean = 13, sd = 1.3); z2 = rnorm(107, mean = 15, sd = 2.3)
Y0 = rep("No", 107);Y1 = rep("Yes",47)
a = c(x2,x1); b = c(z2,z1); c = c(Y0,Y1); 
data = data.frame(Group = factor(c), Class1 = a, Class2 = b)

# plot data
ggplot(data, aes(x = Class1, y = Class2, colour = Group)) +
  geom_point() +
  theme_bw()

Created on 2021-09-21 by the reprex package (v2.0.1)

scrameri
  • 667
  • 2
  • 12