Optimizing an lmm aggregation function in R data.table

Question

I want to perform several lmer aggregations of all pairwise combinations of dependent and independent variables (plus some covariates). The number of deps is only a few, but the indeps are ~7000. Rows are participants, which in the example is only 1000, but in practice needs to scale to 10k-100k. This runs, but on the full model with more covariates it takes a fair amount of time (~2 hours w/ 8 threads) and I would like to shorten that up if possible.

library(data.table)
library(lmerTest) # loads lme4

# Functions -------------------------
get_linreg = function(dat, outcome, exposure){
  model = lmer(get(outcome) ~ get(exposure) + (1|IQ) + HrsTrn + HrsWrk,
               na.action = na.exclude, data=dat)
  # using Satterthwaite's approximation of significance
  a = summary(model)
  return(as.list(c(a$coefficients[2, -3], nobs=nobs(model))))
}

# Main -------------------------
# test data from (https://towardsdatascience.com/canonical-correlation-analysis-b1a38847219d)
dat = fread('https://articledatas3.s3.eu-central-1.amazonaws.com/CanonicalCorrelationAnalysisData.csv')

# make out table
outs = c(names(dat)[7:9]) # dependent vars
exps = c(names(dat)[1:3]) # independent vars
dt1 = CJ(dx=outs, ix=exps)

# fast run function
dt1[, c("beta", "se", "t stat", "pval", "nobs") := get_linreg(dat, dx, ix), by=dt1[,.I]]

I think I have it as fast as it will go. Several posts 1 2 have mentioned vectorizing the function for optimization, but both summary and nobs depend on model and cannot be extracted directly from the model. The lmer function is expecting a data table with the formula specified as col names, so I don't think I can melt it either 3.

A few things. (1) You can use `reformulate(c(exposure, "(1|IQ)", "HrsTrn", "HrsWrk"), response = outcome)` as your formula to avoid needing `get()` (fragile). (2) Looks like Satterthwaite dfs are very large, you might be able to save a tiny bit by using `ddf = lme4"`; (3) you can save a bit of time by using `refit()` when changing only the *response* variable (leaving everything else the same). I haven't bothered with benchmarking any of this ... also see `vignette("lmerperf", package = "lme4")` for performance tips — Ben Bolker, May 21 '23 at 23:16

Optimizing an lmm aggregation function in R data.table

0 Answers0