mgcv bam() error: cannot allocate vector of size 99.6 Gb

Question

I am trying to fit an additive mixed model using bam (mgcv library). My dataset has 10^6 observations from a longitudinal study on growth in 2.10^5 children nested in 300 health centers. I am looking for the slope for each center. The model is

bam(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ center+ year+ year*center+s(child, bs="re"), data)

Whenever, when I try to fit the model the following error message appears:

Error: cannot allocate vector of size 99.6 Gb
In addition: Warning message:
In matrix(by, n, q) : data length exceeds size of matrix

I am working on a cluster with 500 Gb de RAM.

Thank you for any help

Solutions to this are either very general (get more RAM) or very, very specific to your particular modeling task. e.g. see https://stackoverflow.com/q/10917532/324364 — joran, Dec 27 '17 at 22:31

Ben Bolker · Accepted Answer · 2017-12-27T22:49:07.667

To diagnose more precisely where the problem is, try fitting your model with various terms left out. There are several terms in the model that could blow up on you:

the fixed effects involving center will blow up to 300 columns * 10^6 rows; depending on whether year is numeric or a factor, the year*center term could blow up to 600 columns or (nyears*300) columns
it's not clear to me whether bam uses sparse matrices for s(.,bs="re") terms; if not, you'll be in big trouble (2*10^5 columns * 10^6 rows)

Order of magnitude, a vector of 10^6 numeric values (one column of your model matrix) takes 7.6 Mb, so 500 GB / 7.6 MB would be approximately 65,000 columns ...

Just taking a guess here, but I would try out the gamm4 package. It's not specifically geared for low-memory use, but:

‘gamm4’ is most useful when the random effects are not i.i.d., or when there are large numbers of random coeffecients [sic] (more than several hundred), each applying to only a small proportion of the response data.

I would also make most of the terms into random effects:

gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ 
 (1|center)+ (1|year)+ (1|year:center)+(1|child), data)

or, if there are not very many years in the data set, treat year as a fixed effect:

gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ 
 year + (1|center)+ (1|year:center)+(1|child), data)

If there are a small number of years then (year|center) might make sense, to assess among-center variation and covariation among years ... if there are many years, consider making it a smooth term instead ...

Thanks Ben. I will try it. Data come from 10 years. I didn´t include the centers as random because I have not a sample of centers but the population — Adriana, Dec 27 '17 at 23:04
I don't think that's a barrier to treating centers as random. You can see my opinion on this topic [here](https://stats.stackexchange.com/questions/4700/what-is-the-difference-between-fixed-effect-random-effect-and-mixed-effect-mode/155056#155056) (and other opinions in the other answers) — Ben Bolker, Dec 27 '17 at 23:41
+1 I'd also see if the original model can be fitted with the `discrete = TRUE` argument to `bam()`. Although, I don't think the issue is with the smooths here but, as Ben indicates, the random effects, and I'm not sure how the discretization will help with memory usage for the `re` smooths. — Gavin Simpson, Dec 28 '17 at 17:27

mgcv bam() error: cannot allocate vector of size 99.6 Gb

1 Answers1

Linked