1

I noticed that the minimized values for the MSE (mean of squared errors) and MAE (mean of absolute values of the errors) criteria returned by "$cvm" or "plot()" in cv.glmnet are twice the actual values when the outcome is binary. This is very confusing. There must be some reason for this. Is this explained somewhere or do I understand something wrong?

# sample R code to illustrate huge discrepancy between MSE and MAE
# values from cv.glmnet when compared to prediction errors
# calculated applying a test data set 

# generate binomial data
n <- 10000
set.seed(1234)
x1 <- runif(n, -2, 2); x2 <- runif(n, -2, 2)
p <- exp(x1 + x2)/(1 + exp(x1 + x2))
y <- rbinom(n, 1, p)
  # Note: if the line above is replaced by  y <- 5*p + rnorm(n)
  #  and later family="gaussian" in cv.glmnet() 
  #  then there are no similar discrepancies as with binary data

# predictors
x <- matrix(0, n, 10)
for (k in 1:ncol(x)) {a <- runif(1) 
 x[, k] <- 0.5*(a*x1 + (1-a)*x2 + runif(n, -2, 2))}

# training data
xa <- x[seq(n/2), ]; ya <- y[seq(n/2)]
# test data
xb <- x[-1*seq(n/2), ]; yb <- y[-1*seq(n/2)]

install.packages("glmnet")
library(glmnet)
# set alpha=1 i.e. apply Lasso regression with optimal regularization
# parameter lambda chosen by MSE or MAE criterion
cvfit_MSE <- cv.glmnet(xa, ya, family="binomial", alpha=1, 
                       type.measure="mse", nfolds=10)
cvfit_MAE <- cv.glmnet(xa, ya, family="binomial", alpha=1, 
                       type.measure="mae", nfolds=10)

min(cvfit_MSE$cvm) # MSE=0.3734 (mean of squared errors)
min(cvfit_MAE$cvm) # MAE=0.7477 (mean of absolute values of the errors)
plot(cvfit_MSE) 
           # also this has min 0.3734 on y-axis (labelled as "MSE")
plot(cvfit_MAE) # here y-axis (labelled as MAE) starts at 0.75 
                # but even random guessing would produce MAE about 0.5, 
                # extremely confusing... should y-axis label be 2*MAE??

# obtain predictions for the test data:
pp <- predict(cvfit_MSE, newx=xb, s="lambda.min", type="response")

# compare these to the actual observed values (variable "yb") in test data.
# surprisingly, MSE and MAE from these predictions are half of those from cv.glmnet:
mean((pp-yb)^2) # MSE=0.17650 (mean of squared errors)
mean(abs(pp-yb)) # MAE=0.36380 (mean of absolute values of the errors)
kjetil b halvorsen
  • 1,206
  • 2
  • 18
  • 28
Mark Nh
  • 11
  • 3
  • 1
    Welcome to StackOverflow. Please include enough info for minimum, reproducible example. For example, the libraries you load. Please review this as well: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example And take the tour: https://stackoverflow.com/tour – John Polo Mar 01 '23 at 14:03

0 Answers0