3

I am trying to fit an additive mixed model using bam (mgcv library). My dataset has 10^6 observations from a longitudinal study on growth in 2.10^5 children nested in 300 health centers. I am looking for the slope for each center. The model is

bam(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ center+ year+ year*center+s(child, bs="re"), data)

Whenever, when I try to fit the model the following error message appears:

Error: cannot allocate vector of size 99.6 Gb
In addition: Warning message:
In matrix(by, n, q) : data length exceeds size of matrix

I am working on a cluster with 500 Gb de RAM.

Thank you for any help

Adriana
  • 33
  • 4
  • Solutions to this are either very general (get more RAM) or very, very specific to your particular modeling task. e.g. see https://stackoverflow.com/q/10917532/324364 – joran Dec 27 '17 at 22:31

1 Answers1

7

To diagnose more precisely where the problem is, try fitting your model with various terms left out. There are several terms in the model that could blow up on you:

  • the fixed effects involving center will blow up to 300 columns * 10^6 rows; depending on whether year is numeric or a factor, the year*center term could blow up to 600 columns or (nyears*300) columns
  • it's not clear to me whether bam uses sparse matrices for s(.,bs="re") terms; if not, you'll be in big trouble (2*10^5 columns * 10^6 rows)

Order of magnitude, a vector of 10^6 numeric values (one column of your model matrix) takes 7.6 Mb, so 500 GB / 7.6 MB would be approximately 65,000 columns ...

Just taking a guess here, but I would try out the gamm4 package. It's not specifically geared for low-memory use, but:

‘gamm4’ is most useful when the random effects are not i.i.d., or when there are large numbers of random coeffecients [sic] (more than several hundred), each applying to only a small proportion of the response data.

I would also make most of the terms into random effects:

gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ 
 (1|center)+ (1|year)+ (1|year:center)+(1|child), data)

or, if there are not very many years in the data set, treat year as a fixed effect:

gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ 
 year + (1|center)+ (1|year:center)+(1|child), data)

If there are a small number of years then (year|center) might make sense, to assess among-center variation and covariation among years ... if there are many years, consider making it a smooth term instead ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • 1
    Thanks Ben. I will try it. Data come from 10 years. I didn´t include the centers as random because I have not a sample of centers but the population – Adriana Dec 27 '17 at 23:04
  • I don't think that's a barrier to treating centers as random. You can see my opinion on this topic [here](https://stats.stackexchange.com/questions/4700/what-is-the-difference-between-fixed-effect-random-effect-and-mixed-effect-mode/155056#155056) (and other opinions in the other answers) – Ben Bolker Dec 27 '17 at 23:41
  • 2
    +1 I'd also see if the original model can be fitted with the `discrete = TRUE` argument to `bam()`. Although, I don't think the issue is with the smooths here but, as Ben indicates, the random effects, and I'm not sure how the discretization will help with memory usage for the `re` smooths. – Gavin Simpson Dec 28 '17 at 17:27