2

I would like to fit a logistic regression with ridge regularization. Here is my code

library(modeldata)
library(glmnet)

# check the data
data(attrition)
head(attrition)

# split the data into training 80%, and test 20%
smp_size <- floor(0.8 * nrow(attrition))

## set the seed to make your partition reproducible
set.seed(123)

# randomly get the index for training data
train_ind <- sample(seq_len(nrow(attrition)), size = smp_size)

# get training and testing data
train <- attrition[train_ind, ]
test <- attrition[-train_ind, ]


# fit the model
X <- model.matrix(Attrition~ ., train)
lm_ridge <- glmnet(X, train$Attrition, family = 'binomial', alpha = 0)


# get predicted values based on ridge regularization
prob_ridge <- predict(lm_ridge, model.matrix(Attrition~ ., test), type = 'response')

The prob_ridge gives a matrix of 294 * 100. But I am expecting just one column, 294*1. Anything wrong with my code? Why am I getting a matrix from the predict function?

ycenycute
  • 688
  • 4
  • 10
  • 20
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Aug 07 '21 at 04:21
  • @MrFlick Thanks for the tips. I updated my codes. – ycenycute Aug 07 '21 at 06:07

1 Answers1

0

For glmnet, a series of lambdas are fitted, so you get coefficients for each lambda and also predictions for each lambda. As documented in the vignette:

If multiple values of s are supplied, a matrix of predictions is produced. If no value of s is supplied, a matrix of predictions is supplied, with columns corresponding to all the lambdas used in the fit.

So in your case, your lambda values are:

head(lm_ridge$lambda,50)
 [1] 84.7169444 77.1909245 70.3334955 64.0852617 58.3921036 53.2047101
 [7] 48.4781503 44.1714850 40.2474120 36.6719429 33.4141086 30.4456913
[13] 27.7409800 25.2765478 23.0310489 20.9850340 19.1207814 17.4221439
[19] 15.8744087 14.4641699 13.1792130 12.0084080 10.9416141  9.9695913
[25]  9.0839203  8.2769298  7.5416302  6.8716526  6.2611939  5.7049667
[31]  5.1981532  4.7363636  4.3155981  3.9322122  3.5828853  3.2645917
[37]  2.9745744  2.7103214  2.4695439  2.2501564  2.0502587  1.8681194
[43]  1.7021608  1.5509455  1.4131638  1.2876222  1.1732334  1.0690066
[49]  0.9740390  0.8875081

If you choose lambda (s = 0.8875081), then you get 1 column:

pred = predict(lm_ridge, model.matrix(Attrition~ ., test), type = 'response',
s = 0.8875081)
dim(pred)
[1] 294   1

If you want to know the optional lambda, you can follow the example in the vignette (mentioned above) and use a cross-validation approach with cv.glmnet, for example:

cvfit = cv.glmnet(X, train$Attrition, family = 'binomial', alpha = 0)
pred = predict(cvfit, model.matrix(Attrition~ ., test), type = 'response')

dim(pred)
[1] 294   1

By default it chooses:

“lambda.1se”: the largest at which the MSE is within one standard error of the smallest MSE (default).

StupidWolf
  • 45,075
  • 17
  • 40
  • 72