Explanation of the Problem
I am comparing a few models, and my dataset is so small that I would much rather use cross validation than splitting out a validation set. One of my models is made using glm
"GLM", another by cv.glmnet
"GLMNET". In pseudocode, what I'd like to be able to to is the following:
initialize empty 2x2 matrices GLM_CONFUSION and GLMNET_CONFUSION
# Cross validation loop
For each data point VAL in my dataset X:
Let TRAIN be the rest of X (not including VAL)
Train GLM on TRAIN, use it to predict VAL
Depending on if it were a true positive, false positive, etc...
add 1 to the correct entry in GLM_CONFUSION
Train GLMNET on TRAIN, use it to predict VAL
Depending on if it were a true positive, false positive, etc...
add 1 to the correct entry in GLMNET_CONFUSION
This is not hard to do, the problem lies in cv.glmnet
already using cross validation
to deduce the best value of the penalty lambda
. It would be convenient if I could have cv.glmnet
automatically build up the confusion matrix of the best model, i.e. my code should look like:
initialize empty 2x2 matrices GLM_CONFUSION and GLMNET_CONFUSION
Train GLMNET on X using cv.glmnet
Set GLMNET_CONFUSION to be the confusion matrix of lambda.1se (or lambda.min)
# Cross validation loop
For each data point VAL in my dataset X:
Let TRAIN be the rest of X (not including VAL)
Train GLM on TRAIN, use it to predict VAL
Depending on if it were a true positive, false positive, etc...
add 1 to the correct entry in GLM_CONFUSION
Not only would it be convenient, it is somewhat of a necessity - there are two alternatives:
- Use
cv.glmnet
to find a newlambda.1se
on TRAIN at every iteration of the cross validation loop. (i.e. a nested cross-validation) - Use
cv.glmnet
to findlambda.1se
on X, and then 'fix' that value and treat it like a normal model to train during the cross validation loop. (two parallel cross-validations)
The second one is philosophically incorrect as it means GLMNET would have information on what it is trying to predict in the cross validation loop. The first would take a large chunk of time - I could in theory do it, but it might take half an hour and I feel as if there should be a better way.
What I've Looked At So Far
I've looked at the documentation of cv.glmnet
- it does not seem like you can do what I am asking, but I am very new to R and data science in general so it is perfectly possible that I have missed something.
I have also looked on this website and seen some posts that at first glance appeared to be relevant, but in fact are asking for something different - for example, this post: tidy predictions and confusion matrix with glmnet
The above post appears similar to what I want, but it is not quite what I am looking for - it appears they are using predict.cv.glmnet
to make new predictions, and then creating the confusion matrix of that - whereas I want the confusion matrix of the predictions made during the cross validation step.
I'm hoping that someone is able to either
- Explain if and how it is possible to create the confusion matrix as described
- Show that there is a third alternative separate to the two I proposed
- "Hand-implement
cv.glmnet
" is not a viable alternative :P
- "Hand-implement
- Conclusively state that what I want is not possible and that I need to do one of the two alternatives I mentioned.
Any one of those would be a perfectly fine answer to this question (although I'm hoping for option 1!)
Apologies if there is something simple I have missed!