40

I would like to extract the glmnet generated model coefficients and create a SQL query from them. The function coef(cv.glmnet.fit) yields a 'dgCMatrix' object. When I convert it to a matrix using as.matrix, the variable names are lost and only the coefficient values are left behind.

I know one can print the coefficients in the screen, however is it possible to write the names to a data frame?

Can anybody assist to extract these names?

LyzandeR
  • 37,047
  • 12
  • 77
  • 87
David Eborall
  • 401
  • 1
  • 4
  • 3
  • You need to post a reproducible example. Post some data, make an example tell us what the problem is and what you would like it to be. Using `glmnet` I am always getting variable names by default so I do not understand the question. – LyzandeR Jan 06 '15 at 16:14

9 Answers9

35

UPDATE: Both first two comments of my answer are right. I have kept the answer below the line just for posterity.

The following answer is short, it works and does not need any other package:

tmp_coeffs <- coef(cv.glmnet.fit, s = "lambda.min")
data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i + 1], coefficient = tmp_coeffs@x)

The reason for +1 is that the @i method indexes from 0 for the intercept but @Dimnames[[1]] starts at 1.


OLD ANSWER: (only kept for posterity) Try these lines:

The non zero coefficients:

coef(cv.glmnet.fit, s = "lambda.min")[which(coef(cv.glmnet.fit, s = "lambda.min") != 0)]

The features that are selected:

colnames(regression_data)[which(coef(cv.glmnet.fit, s = "lambda.min") != 0)]

Then putting them together as a dataframe is staight forward, but let me know if you want that part of the code also.


Mehrad Mahmoudian
  • 3,466
  • 32
  • 36
  • 4
    Note that `colnames(regression_data)[which(coef(cv.glmnet.fit, s = "lambda.min") != 0)]` does not take into account the intercept (the first column) and therefore shows wrong names – RUser4512 Sep 05 '16 at 16:09
  • 2
    `@x` object method will give you non-zero coefficients. – Davor Josipovic Dec 05 '16 at 09:52
  • thank you for your input. I now provided a better solution – Mehrad Mahmoudian Apr 19 '17 at 16:23
  • This is still incorrect. tmp_coeffs@i shows an index of 0 for the intercept whereas tmp_coeffs@Dimnames[[1]] has intercept at position 1. You need to add 1 or use Peter's solution below. – badmax Jul 26 '17 at 02:47
  • @Mehrad Mahmoudian: it would be great if you could remove the first two comments of your answer (e.g., suggesting using `coef(cv.glmnet.fit, s = "lambda.min")[which(coef(cv.glmnet.fit, s = "lambda.min") != 0)]` to get the coefficients). These give the wrong coefs/variable names unless the model has been fitted without an intercept and will lead users astray. – jruf003 Jan 05 '18 at 02:05
  • @jruf003 I see your point. What about I switch places of the old answer and the updated one? – Mehrad Mahmoudian Jan 05 '18 at 12:42
  • @MehradMahmoudian: Ultimately it's your answer, but personally I think it's misleading having the coef(cv.glmnet.fit, s = "lambda.min")[which(coef(cv.glmnet.fit, s = "lambda.min") != 0)]` code (etc) included in the answer as it gives an incorrect result – jruf003 Jan 07 '18 at 23:11
  • @jruf003 I guess you would approve the new arrangement. Thanks for your comment. I appreciate it. – Mehrad Mahmoudian Jan 08 '18 at 13:11
  • An annoying feature about this approach is that any logical predictors are renamed to add "TRUE" to their names. Similarly, names of factors are expanded to add their levels. This means that if one wants to build a formula for fitting a model using these, one has to clean the names, and deal with factor levels. – CoderGuy123 Jan 29 '18 at 08:07
  • `s = 'lambda.min` gives me an error. `Error in lambda[1] - s : non-numeric argument to binary operator` Any idea ? – joel.wilson Feb 19 '18 at 02:21
  • 1
    @joel.wilson Perhaps you have not used `glmnet::cv.glmnet()` for fitting the model. Is that the case? – Mehrad Mahmoudian May 30 '18 at 08:01
  • @Deleet The reason for adding these appendices to the feature names is show to which direction the coefficient affects the model. if you want to have only the feature names, you can convert all logical to binary and all factors to dummy features (for example by using `varhandle::to.dummy()`). These modifications does not change the fitting, these only does what glmnet does internally in plain sight and let you interpret easier the coefficients. – Mehrad Mahmoudian May 30 '18 at 08:05
9

Check broom package. It has tidy function that converts output of different R objects (including glmnet) into data.frames.

Tim
  • 7,075
  • 6
  • 29
  • 58
6

The names should be accessible as dimnames(coef(cv.glmnet.fit))[[1]], so the following should put both coefficient names and values into a data.frame: data.frame(coef.name = dimnames(coef(GLMNET))[[1]], coef.value = matrix(coef(GLMNET)))

Peter
  • 1,016
  • 9
  • 20
4

Building on Mehrad's solution above, here is a simple function to print a table containing only the non-zero coefficients:

print_glmnet_coefs <- function(cvfit, s="lambda.min") {
    ind <- which(coef(cvfit, s=s) != 0)
    df <- data.frame(
        feature=rownames(coef(cvfit, s=s))[ind],
        coeficient=coef(cvfit, s=s)[ind]
    )
    kable(df)
}

The function above uses the kable() function from knitr to produce a Markdown-ready table.

Keith Hughitt
  • 4,860
  • 5
  • 49
  • 54
  • 1
    `s = 'lambda.min` gives me an error. `Error in lambda[1] - s : non-numeric argument to binary operator` Any idea ? – joel.wilson Feb 19 '18 at 02:21
2

There is an approach with using coef() to glmnet() object (your model). In a case below index [[1]] indicate the number of outcome class in multinomial logistic regression, maybe for other models you shoould remove it.

coef_names_GLMnet <- coef(GLMnet, s = 0)[[1]]
row.names(coef_names_GLMnet)[coef_names_GLMnet@i+1]

row.names() indexes in such case needs incrementing (+1) because numeration of variables (data features) in coef() object begining from 0, but after transformation character vector numeration begining from 1.

Bem Ostap
  • 61
  • 4
2

Here, I wrote a reproducible example and fitted a binary (logistic) example using cv.glmnet. A glmnet model fit will also work. At the end of this example, I assembled non-zero coefficients, and associated features, into a data.frame called myResults:

library(glmnet)
X <- matrix(rnorm(100*10), 100, 10);
X[51:100, ] <- X[51:100, ] + 0.5; #artificially introduce difference in control cases
rownames(X) <- paste0("observation", 1:nrow(X));
colnames(X) <- paste0("feature",     1:ncol(X));

y <- factor( c(rep(1,50), rep(0,50)) ); #binary outcome class label
y
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [51] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1

## Perform logistic model fit:
fit1 <- cv.glmnet(X, y, family="binomial", nfolds=5, type.measure="auc"); #with K-fold cross validation
# fit1 <- glmnet(X, y, family="binomial") #without cross validation also works

## Adapted from @Mehrad Mahmoudian:
myCoefs <- coef(fit1, s="lambda.min");
myCoefs[which(myCoefs != 0 ) ]               #coefficients: intercept included
## [1]  1.4945869 -0.6907010 -0.7578129 -1.1451275 -0.7494350 -0.3418030 -0.8012926 -0.6597648 -0.5555719
## [10] -1.1269725 -0.4375461
myCoefs@Dimnames[[1]][which(myCoefs != 0 ) ] #feature names: intercept included
## [1] "(Intercept)" "feature1"    "feature2"    "feature3"    "feature4"    "feature5"    "feature6"   
## [8] "feature7"    "feature8"    "feature9"    "feature10"  

## Asseble into a data.frame
myResults <- data.frame(
  features = myCoefs@Dimnames[[1]][ which(myCoefs != 0 ) ], #intercept included
  coefs    = myCoefs              [ which(myCoefs != 0 ) ]  #intercept included
)
myResults
##       features      coefs
## 1  (Intercept)  1.4945869
## 2     feature1 -0.6907010
## 3     feature2 -0.7578129
## 4     feature3 -1.1451275
## 5     feature4 -0.7494350
## 6     feature5 -0.3418030
## 7     feature6 -0.8012926
## 8     feature7 -0.6597648
## 9     feature8 -0.5555719
## 10    feature9 -1.1269725
## 11   feature10 -0.4375461
David C.
  • 1,974
  • 2
  • 19
  • 29
  • `s = 'lambda.min` gives me an error. `Error in lambda[1] - s : non-numeric argument to binary operator` Any idea ? – joel.wilson Feb 19 '18 at 02:22
  • Did you use function `glmnet` or `cv.glmnet`? Their resulting data structures aren't the same. – David C. Feb 19 '18 at 02:25
  • "Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'which': 'list' object cannot be coerced to type 'double'" – Emmanuel Goldstein Jun 16 '21 at 17:11
1
# requires tibble.
tidy_coef <- function(x){
    coef(x) %>%
    matrix %>%   # Coerce from sparse matrix to regular matrix.
    data.frame %>%  # Then dataframes.
    rownames_to_column %>%  # Add rownames as explicit variables.
    setNames(c("term","estimate"))
}

Without tibble:

tidy_coef2 <- function(x){
    x <- coef(x)
    data.frame(term=rownames(x),
               estimate=matrix(x)[,1],
               stringsAsFactors = FALSE)
}
1

Assuming you know how to obtain your lambda, I found two different ways to show the predictors needed in the selected model for that particular lambda. One of them includes the intercept. The lambda can be obtained using cross-validation by the mean of cv.glmnet from "glmnet" library. You might want to only look at the last lines for each method:

 myFittedLasso = glmnet(x=myXmatrix, y=myYresponse, family="binomial")
 myCrossValidated = cv.glmnet(x=myXmatrix, y=myYresponse, family="binomial")
 myLambda = myCrossValidated$lambda.1se  # can be simply lambda

 # Method 1 without the intercept
 myBetas = myFittedLasso$beta[, which(myFittedLasso$lambda == myLambda)]
 myBetas[myBetas != 0]
 ## myPredictor1    myPredictor2    myPredictor3
 ##   0.24289802      0.07561533      0.18299284


 # Method 2 with the intercept
 myCoefficients = coef(myFittedLasso, s=myLambda)
 dimnames(myCoefficients)[[1]][which(myCoefficients != 0)]
 ## [1] "(Intercept)"    "myPredictor1"    "M_myPredictor2"    "myPredictor3"

 myCoefficients[which(myCoefficients != 0)]
 ## [1] -4.07805560  0.24289802  0.07561533  0.18299284

Note that the example above implies a binomial distribution but the steps can be applied to any other kind.

0

I faced a similar issue when using glmnet from the tidymodels framework, where the model was trained within a workflow and neither coef() nor the above solutions worked.

What worked for me though, was part of the glmnet:::coef.glmnet code:

# taken from glmnet:::coef.glmnet
coefs <- predict(x, "lambda.min", type = "coefficients", exact = FALSE)

dd <- cbind(
  data.frame(var = rownames(coefs)),
  as.data.table(as.matrix(coefs))
)
David
  • 9,216
  • 4
  • 45
  • 78