0

Explanation of the Problem

I am comparing a few models, and my dataset is so small that I would much rather use cross validation than splitting out a validation set. One of my models is made using glm "GLM", another by cv.glmnet "GLMNET". In pseudocode, what I'd like to be able to to is the following:

initialize empty 2x2 matrices GLM_CONFUSION and GLMNET_CONFUSION

# Cross validation loop
For each data point VAL in my dataset X:
  Let TRAIN be the rest of X (not including VAL)

  Train GLM on TRAIN, use it to predict VAL
  Depending on if it were a true positive, false positive, etc...
    add 1 to the correct entry in GLM_CONFUSION

  Train GLMNET on TRAIN, use it to predict VAL
  Depending on if it were a true positive, false positive, etc...
    add 1 to the correct entry in GLMNET_CONFUSION

This is not hard to do, the problem lies in cv.glmnet already using cross validation to deduce the best value of the penalty lambda. It would be convenient if I could have cv.glmnet automatically build up the confusion matrix of the best model, i.e. my code should look like:

initialize empty 2x2 matrices GLM_CONFUSION and GLMNET_CONFUSION

Train GLMNET on X using cv.glmnet
Set GLMNET_CONFUSION to be the confusion matrix of lambda.1se (or lambda.min)

# Cross validation loop
For each data point VAL in my dataset X:
  Let TRAIN be the rest of X (not including VAL)

  Train GLM on TRAIN, use it to predict VAL
  Depending on if it were a true positive, false positive, etc...
    add 1 to the correct entry in GLM_CONFUSION

Not only would it be convenient, it is somewhat of a necessity - there are two alternatives:

  1. Use cv.glmnet to find a new lambda.1se on TRAIN at every iteration of the cross validation loop. (i.e. a nested cross-validation)
  2. Use cv.glmnet to find lambda.1se on X, and then 'fix' that value and treat it like a normal model to train during the cross validation loop. (two parallel cross-validations)

The second one is philosophically incorrect as it means GLMNET would have information on what it is trying to predict in the cross validation loop. The first would take a large chunk of time - I could in theory do it, but it might take half an hour and I feel as if there should be a better way.

What I've Looked At So Far

I've looked at the documentation of cv.glmnet - it does not seem like you can do what I am asking, but I am very new to R and data science in general so it is perfectly possible that I have missed something.

I have also looked on this website and seen some posts that at first glance appeared to be relevant, but in fact are asking for something different - for example, this post: tidy predictions and confusion matrix with glmnet

The above post appears similar to what I want, but it is not quite what I am looking for - it appears they are using predict.cv.glmnet to make new predictions, and then creating the confusion matrix of that - whereas I want the confusion matrix of the predictions made during the cross validation step.

I'm hoping that someone is able to either

  1. Explain if and how it is possible to create the confusion matrix as described
  2. Show that there is a third alternative separate to the two I proposed
    • "Hand-implement cv.glmnet" is not a viable alternative :P
  3. Conclusively state that what I want is not possible and that I need to do one of the two alternatives I mentioned.

Any one of those would be a perfectly fine answer to this question (although I'm hoping for option 1!)

Apologies if there is something simple I have missed!

BaileyA
  • 145
  • 1
  • 7
  • 2
    here is an [answer](https://stackoverflow.com/questions/62183291/statistical-test-with-test-data) to a related question that you might find helpful. In general its best to use a [meta ML package](https://towardsdatascience.com/meta-machine-learning-packages-in-r-c3e869b53ed6) to handle tuning and evaluation of models. [`caret`](https://topepo.github.io/caret/) is probably the most know such package in R. Although it is outdated. Newer variants include [`tidymodels`](https://www.tidymodels.org/) and [`mlr3`](https://mlr3book.mlr-org.com/). I personally use mlr3 atm. – missuse Nov 09 '21 at 17:18
  • 1
    Here is a link to the mlr3 gallery https://mlr3gallery.mlr-org.com/. Search for posts that include the tag nested resampling. I use mlr3 because I think its the most flexible variant of all available for R atm. It takes a bit of getting used to. If you do not plan to do this sort of thing often and do not need to tune ML pipelines then perhaps caret is the best choice. – missuse Nov 09 '21 at 17:21
  • Thank you so much for pointing me in this direction! It was exactly what I needed :) I'll be taking a closer look at those resources over the coming days to try and become fluent in those packages. – BaileyA Nov 09 '21 at 20:42

1 Answers1

0

Thanks to @missuse's advice, I was able to get a solution that worked for me! It corresponds to option 2 in my post, with this alternative being to use the caret package.

In essence we need to attach a custom summary function to caret's model trainer. I mostly bumbled about for a couple hours until I got it to work - there may be better ways to do this, and I encourage others to post alternate answers if they know of any! My code is at the bottom (it's been slightly modified to make it not specific to the task I was working on)

Hopefully if anyone has a similar problem then this will help. Another resource that I found useful in solving this was the following post: https://stats.stackexchange.com/questions/299653/caret-glmnet-vs-cv-glmnet, as in it you can see very clearly how to convert a call to cv.glmnet into a call to caret's train version of glmnet.

library(caret)

# Confusion Matrix of model outputs
CM <- function(model) {
  # Need to find index of best tune found by
  # cross validation
  idx <- 1
  for (i in 1:nrow(model$results)) {
    check <- model$results[i,]
    foundBest <- TRUE
    for (col in colnames(model$bestTune)) {
      if (check[,col] != model$bestTune[,col]) {
        foundBest <- FALSE
        break
      }
    }
    if (foundBest) {
      idx <- i
      break
    }
  }
  
  # They are averaged w.r.t. the number of folds (ctrl$number)
  # hence the multiplication
  c(
    model$results[idx,]$true_pos,
    model$results[idx,]$false_pos,
    model$results[idx,]$false_neg,
    model$results[idx,]$true_neg
  ) * model$control$number
}

# Summary function from the training to give confusion metric
SummaryFunc <- function (data, lev = NULL, model = NULL) { 

    # This puts our output in the right format
    out <- postResample(data$pred, data$obs)

    # Get the confusion matrix
    cm <- confusionMatrix(
      factor(data$pred, levels=c(0, 1)),
      factor(data$obs, levels=c(0, 1))
    )$table
    
    # Add those details to the output
    oldnames <- names(out)
    out <- c(out, cm[1, 1], cm[2, 1], cm[1, 2], cm[2, 2])
    names(out) <- c(oldnames, "true_pos", "false_pos", "false_neg", "true_neg")
    
    out
}


# 10-fold cross validation, as in cv.glmnet implementation
ctrl <- trainControl(
  method="cv",
  number=10,
  summaryFunction=SummaryFunc,
)


# Example of standard glm
our.glm <- train(
  your_formula,
  data=your_data,
  method="glm",
  family=gaussian(link="identity"),
  trControl=ctrl,
  metric="RMSE"
)

# Example of what used to be cv.glmnet
our.glmnet <- train(
  your_feature_matrix,
  your_label_matrix,
  method="glmnet",
  family=gaussian(link="identity"),
  trControl=ctrl,
  metric="RMSE",
  tuneGrid = expand.grid(
    alpha = 1,
    lambda = seq(0.001, 0.1, by=0.001)
  )
)

CM(our.glm)
CM(our.glmnet)
BaileyA
  • 145
  • 1
  • 7