0

My problem

I would like to build a logistic regression model with a high AUC in predicting a binary variable.

I would like to use the following approach (if feasible):

  1. Use an elastic net model (glmnet) to reduce the predictors and find the best hyperparameter (alpha and lambda)

  2. Combine the output of this model (a simple linear combination) with an additional predictor (the opinion of a super doctor superdoc) in a logistic regression model (=finalmodel), similar as described on page 26 in:

Afshar P, Mohammadi A, Plataniotis KN, Oikonomou A, Benali H. From Handcrafted to Deep-Learning-Based Cancer Radiomics: Challenges and Opportunities. IEEE Signal Process Mag 2019; 36: 132–60. Available here

Example data

As example data I have a dataset with many numeric predictors and a binary (pos/neg) outcome (diabetes).

# library
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes

# add the super doctors opinion to the data
set.seed(2323)
data %>% 
  rowwise() %>% 
  mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)), TRUE~ 0)) -> data

# separate the data in a training set and test set
train.data <- data[1:550,]
test.data <- data[551:768,]

Created on 2021-03-14 by the reprex package (v1.0.0)

What I already tried

# train the model (without the superdoc's opinion)
set.seed(2323)
model <- train(
  diabetes ~., data = train.data %>% select(-superdoc), method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE,
                           summaryFunction = twoClassSummary),
  tuneLength = 10,
  metric="ROC" #ROC metric is in twoClassSummary
)


# extract the coefficients for the best alpha and lambda  
coef(model$finalModel, model$finalModel$lambdaOpt) -> coeffs
tidy(coeffs) %>% tibble() -> coeffs

coef.interc = coeffs %>% filter(row=="(Intercept)") %>% pull(value)
coef.pregnant = coeffs %>% filter(row=="pregnant") %>% pull(value)
coef.glucose = coeffs %>% filter(row=="glucose") %>% pull(value)
coef.pressure = coeffs %>% filter(row=="pressure") %>% pull(value)
coef.mass = coeffs %>% filter(row=="mass") %>% pull(value)
coef.pedigree = coeffs %>% filter(row=="pedigree") %>% pull(value)
coef.age = coeffs %>% filter(row=="age") %>% pull(value)


# combine the model with the superdoc's opinion in a logistic regression model
finalmodel = glm(diabetes ~ superdoc + I(coef.interc + coef.pregnant*pregnant + coef.glucose*glucose + coef.pressure*pressure + coef.mass*mass + coef.pedigree*pedigree + coef.age*age),family=binomial, data=train.data)


# make predictions on the test data
predict(finalmodel,test.data, type="response") -> predictions


# check the AUC of the model in the test data
roc(test.data$diabetes,predictions, ci=TRUE) 
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> 
#> Call:
#> roc.default(response = test.data$diabetes, predictor = predictions,     ci = TRUE)
#> 
#> Data: predictions in 145 controls (test.data$diabetes neg) < 73 cases (test.data$diabetes pos).
#> Area under the curve: 0.9345
#> 95% CI: 0.8969-0.9721 (DeLong)

Created on 2021-03-14 by the reprex package (v1.0.0)

Where I am not really sure...

I think to find the most accurate model and to avoid overfitting, I have to use a nested cross-validation (as I learned here and here). However, I am not sure how to do so. At the moment everytime I use another set.seed different predictors get selected and I get different AUCs. Can this be migitated with proper use of nested cross validation?

Update 1

I just learned that nested CV does not help you get the most accurate model. The problem is, that I get variating coefficients with different set.seet in the second code sample above. I have actually the same problem as described here: Extract the coefficients for the best tuning parameters of a glmnet model in caret

One posted solution is to use repeated CV to migitate this variation. Unfortunately I could not get this run.

Update 2

Using "repeatedcv" solved my problem. Using repeated cv not nested cv did the trick!

model <- train(
  diabetes ~., data = train.data %>% select(-superdoc), method = "glmnet",
  trControl = trainControl("repeatedcv",
                           number = 10,
                           repeats=10,
                           classProbs = TRUE,
                           savePredictions = TRUE,
                           summaryFunction = twoClassSummary),
  tuneLength = 10,
  metric="ROC" #ROC metric is in twoClassSummary
)
ava
  • 840
  • 5
  • 19
  • nested CV is used to measure the performance of the modeling approach. To see how differences in the train set impact the ML pipeline hyper parameters and what variance in performance can be expected. After running nested CV you need to run the modeling pipeline on the whole train set to get one model which you will use for predicting. Nested CV does not help you get the most accurate model, it just helps you evaluate models unbiasedly. – missuse Mar 14 '21 at 07:31
  • I have exactly this problem if I run the second code chunck with different `set.seed`:https://stackoverflow.com/questions/48079660/extract-the-coefficients-for-the-best-tuning-parameters-of-a-glmnet-model-in-car – ava Mar 14 '21 at 12:16
  • 1
    Try using repeated CV to drive down the variation just as the creator of caret suggests in the comments of my answer. – missuse Mar 14 '21 at 14:05
  • @missuse Just for my understanding, is there no risk of overfitting because I use the whole train set for the elastic net model and then I use the same train set to build the nested logistic regression? – ava Mar 14 '21 at 15:04
  • 1
    Usually stacked learners are built using out of fold predictions from lower level learners. Check this blog post - https://mlr3gallery.mlr-org.com/posts/2020-04-27-tuning-stacking/ and this book chapter: https://mlr3book.mlr-org.com/pipe-nonlinear.html. That being said I am not familiar with the paper by Afshar et al you reference nor do I have time atm to check it out. – missuse Mar 14 '21 at 16:15
  • Marvellous! Will definitively have a look at it. – ava Mar 14 '21 at 17:38

1 Answers1

0

Thanks to @missuse I could solve my questions:

Cross-validation does not help get the most accurate model. This (resp. my) misunderstanding is discussed beautifully in the blog post: The "Cross-Validation - Train/Predict" misunderstanding

The problem with seed depending variations of predictor's coefficients of glmnet in small datasets can be migitated with repeated cross-validation (ie, "repeatedcv" in caret::trainControl as described in the comments here)

Stacked learners (in my case stacked glmnet and glm) are usually built using out of fold predictions from lower level lerners. This could be done using the mlr3 package as described in this blog post: Tuning a stacked learner. Since this was not an initial question, I opened a new question here.

ava
  • 840
  • 5
  • 19