16

I'm trying to build a model with the glmnet package, but I'm getting the following error when I run the following line:

#library('glmnet')
x = model.matrix(response ~ ., data = acgh_frame[,c(3:ncol(acgh_frame))])

Error: protect(): protection stack overflow

I know this is due to my large number of variables (26k+) in the dataframe. When I use fewer variables the error doesn't show. I know how to solve this in command line R, but I require to stay in R studio, so I want to fix it from R Studio. So, how do I do this?

Phil
  • 7,287
  • 3
  • 36
  • 66
Ansjovis86
  • 1,506
  • 5
  • 17
  • 48

3 Answers3

11

@Ansjovis86

You can specify the ppsize as a command line argument to Rstudio

rstudio.exe --max-ppsize=5000000

You may also with to set the expression option via your .Rprofile or at runtime by using the options(expressions = 5e5) command.

> options(expressions = 5e5)
>?options

...

expressions:

sets a limit on the number of nested expressions that will be evaluated. Valid values are 25...500000 with default 5000. If you increase it, you may also want to start R with a larger protection stack; see --max-ppsize in Memory. Note too that you may cause a segfault from overflow of the C stack, and on OSes where it is possible you may want to increase that. Once the limit is reached an error is thrown. The current number under evaluation can be found by calling Cstack_info.

Cstack_info() - to determine current setting.s
Technophobe01
  • 8,212
  • 3
  • 32
  • 59
  • Do you have any idea why this bug when I run your code in my RStudio? ```> rstudio.exe --max-ppsize=5000000 Error in rstudio.exe - -max - ppsize = 5000000 : object 'rstudio.exe' not found``` – vog Apr 25 '22 at 15:36
  • @vog I know this is an old thread, but was your directory set properly using cd? – Scott Hebert Apr 04 '23 at 18:42
  • I have not seen the object now found message before. On Windows, you can invoke `rstudio.exe` via the command line as follows: `"C:\Program Files\RStudio\bin\rstudio.exe" --max-ppsize=5000000`, the "" quotes are important due to the space in `Program Files.` You may know this, but for completeness - you cannot invoke rstudio.exe without the full path unless its directory is added to the path via system settings. Hope that helps, apologies I missed your question previously. – Technophobe01 Apr 04 '23 at 19:30
2

The root cause is the model.matrix function, which will 1) use a lot of memory; and 2) throw this error for a sufficiently large no. of columns.

Try using my glmnetUtils package, which will get around both these problems. Rather than building the model matrix in one go, it does it term by term; and it also doesn't try to evaluate huge formulas. This is a lot faster, and doesn't risk blowing up the stack.

install.packages("glmnetUtils")
library(glmnetUtils)
glmnet(response ~ ., data = acgh_frame[3:ncol(acgh_frame)])
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
-2

use PCR or PLSR to reduce your columns

thistleknot
  • 1,098
  • 16
  • 38