1

I am struggling to understand how, in R, to generate predictive simulations for new data using a multilevel linear regression model with a single set of random intercepts. Following the example on pp. 146-147 of this text, I can execute this task for a simple linear model with no random effects. What I can't wrap my head around is how to extend the set-up to accommodate random intercepts for a factor added to that model.

I'll use iris and some fake data to show where I'm getting stuck. I'll start with a simple linear model:

mod0 <- lm(Sepal.Length ~ Sepal.Width, data = iris)

Now let's use that model to generate 1,000 predictive simulations for 250 new cases. I'll start by making up those cases:

set.seed(20912)
fakeiris <- data.frame(Sepal.Length = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
                       Sepal.Width = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
                       Species = sample(as.character(unique(iris$Species)), 250, replace = TRUE),
                       stringsAsFactors=FALSE)

Following the example in the aforementioned text, here's what I do to get 1,000 predictive simulations for each of those 250 new cases:

library(arm)
n.sims = 1000  # set number of simulations
n.tilde = nrow(fakeiris)  # set number of cases to simulate
X.tilde <- cbind(rep(1, n.tilde), fakeiris[,"Sepal.Width"])  # create matrix of predictors describing those cases; need column of 1s to multiply by intercept
sim.fakeiris <- sim(mod0, n.sims)  # draw the simulated coefficients
y.tilde <- array(NA, c(n.sims, n.tilde))  # build an array to hold results
for (s in 1:n.sims) { y.tilde[s,] <- rnorm(n.tilde, X.tilde %*% sim.fakeiris@coef[s,], sim.fakeiris@sigma[s]) }  # use matrix multiplication to fill that array

That works fine, and now we can do things like colMeans(y.tilde) to inspect the central tendencies of those simulations, and cor(colMeans(y.tilde), fakeiris$Sepal.Length) to compare them to the (fake) observed values of Sepal.Length.

Now let's try an extension of that simple model in which we assume that the intercept varies across groups of observations --- here, species. I'll use lmer() from the lme4 package to estimate a simple multilevel/hierarchical model that matches that description:

library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)

Okay, that works, but now what? I run:

sim.fakeiris.lmer <- sim(mod1, n.sims)

When I use str() to inspect the result, I see that it is an object of class sim.merMod with three components:

  • @fixedef, a 1,000 x 2 matrix with simulated coefficients for the fixed effects (the intercept and Sepal.Width)

  • @ranef, a 1,000 x 3 matrix with simulated coefficients for the random effects (the three species)

  • @sigma, a vector of length 1,000 containing the sigmas associated with each of those simulations

I can't wrap my head around how to extend the matrix construction and multiplication used for the simple linear model to this situation, which adds another dimension. I looked in the text, but I could only find an example (pp. 272-275) for a single case in a single group (here, species). The real-world task I'm aiming to perform involves running simulations like these for 256 new cases (pro football games) evenly distributed across 32 groups (home teams). I'd greatly appreciate any assistance you can offer.

Addendum. Stupidly, I hadn't looked at the details on simulate.merMod() in lme4 before posting this. I have now. It seems like it should do the trick, but when I run simulate(mod0, nsim = 1000, newdata = fakeiris), the result has only 150 rows. The values look sensible, but there are 250 rows (cases) in fakeiris. Where is that 150 coming from?

ulfelder
  • 5,305
  • 1
  • 22
  • 40

2 Answers2

4

One possibility is to use the predictInterval function from the merTools package. The package is about to be submitted to CRAN, but the current developmental release is available for download from GitHub,

    install.packages("devtools")
    devtools::install_github("jknowles/merTools")

To get the median and a 95% credible interval of 100 simulations:

    mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)

    out <- predictInterval(mod1, newdata=fakeiris, level=0.95,
                           n.sims=100, stat="median")

By default, predictInterval includes the residual variation, but you can turn that feature off with:

    out2 <- predictInterval(mod1, newdata=fakeiris, level=0.95,
                           n.sims=100, stat="median", 
                           include.resid.var=FALSE)

Hope this helps!

  • Thank you, that looks promising. Does it include an option for returning the results from all simulations, or are they at least embedded somewhere in the returned object? I want to examine the distribution of all of the predictive simulations for each case, not just the central tendency and credible intervals. – ulfelder Aug 11 '15 at 20:21
  • 1
    Currently it doesn't, but that functionality could easily be incorporated into the CRAN release as an option. Feel free to submit this as an issue (https://github.com/jknowles/merTools/issues) so that we don't lose track of it. – Carl Frederick Aug 11 '15 at 20:41
  • As an interim solution, I just took your GitHub script for `predictInterval()`, stopped it at the `yhat <- ....` step, grabbed the required helper functions from your helpers.R script, and ran that. It gave me a matrix of the proper size, and the correlation between the row means of those yhats and the results of `predict.merMod()` was ~1. Hooray! – ulfelder Aug 11 '15 at 21:25
  • 1
    FYI for anyone reading these comments: Carl and his merTools() package co-author have now added `returnSims` as an option in `predictInterval()`, so no need to hack a solution. – ulfelder Aug 12 '15 at 16:09
3

This might help: it doesn't use sim(), but instead uses mvrnorm() to draw the new coefficients from the sampling distribution of the fixed-effect parameters, uses a bit of internal machinery (setBeta0) to reassign the internal values of the fixed-effect coefficients. The internal values of the random effect coefficients are automatically resampled by simulate.merMod using the default argument re.form=NA. However, the residual variance is not resampled -- it is held fixed across the simulations, which isn't 100% realistic.

In your use case, you would specify newdata=fakeiris.

library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
simfun <- function(object,n=1,newdata=NULL,...) {
    v <- vcov(object)
    b <- fixef(object)
    betapars <- MASS::mvrnorm(n,mu=b,Sigma=v)
    npred <- if (is.null(newdata)) {
                 length(predict(object))
             } else nrow(newdata)
    res <- matrix(NA,npred,n)
    for (i in 1:n) {
        mod1@pp$setBeta0(betapars[i,])
        res[,i] <- simulate(mod1,newdata=newdata,...)[[1]]
    }
    return(res)
}
ss <- simfun(mod1,100)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks so much. This function produces a matrix of the expected size, but if I compare the central tendencies of its results to point predictions from `predict()` --- as in, `cor(rowMeans(ss), predict(mod1, newdata = fakeiris))` --- I'm surprised to see a correlation of only ~0.5. Shouldn't the match be much tighter? – ulfelder Aug 11 '15 at 19:21
  • not sure why it's weird -- would have to look carefully to see what's going on (may just delete if I think it's wrong but can't figure out why) – Ben Bolker Aug 11 '15 at 21:41