1

I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.

> # display input data to the model
> head(inputData)
             id     y     x   z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
       id         y         x         z 
 "factor" "numeric" "numeric"  "factor" 
> 
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
> 
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
   Data: inputData
REML criterion at convergence: 444.245
Random effects:
 Groups   Name        Std.Dev.
 z        (Intercept) 0.3097  
 Residual             0.9682  
Number of obs: 157, groups:  z, 17
Fixed Effects:
(Intercept)            x  
    -0.4291       5.5638  
> 
> # display new data to predict in
> head(predData)
           id     x   z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
       id         x         z 
 "factor" "numeric"  "factor" 
> 
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
> 
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
> 
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
> 
> # display results
> head(preds_mid) # correct prediction
       1        2        3        4        5        6 
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470 
> head(preds_interval) # incorrect order
       fit      upr          lwr
1 1.512410 2.694813  0.133571198
2 1.273143 2.521899  0.009878347
3 1.398273 2.785358  0.232501376
4 1.878165 3.188086  0.625161201
5 1.605049 2.813737  0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
       fit      upr         lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
          fit      upr           lwr
1033 1.244261 2.457012 -0.0001299777
>

To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.

The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.

Is this a known issue? What can I do to avoid it?

dmp
  • 815
  • 1
  • 6
  • 19
  • I don't have an answer here, but I'm one of the developers of `merTools` and this seems like a pretty big problem. My guess is that this may be occurring when we try to preserve factor levels that are observed in the model but unobserved in the prediction data. I will see if I can reproduce this. Can you confirm you are using the latest version of merTools from CRAN? – jknowles Mar 30 '16 at 15:09
  • Thanks for your reply. According to `sessionInfo()`, I'm using `merTools_0.2.0`. I am, however running R x64 3.2.2 and `merTools` notes that it was built for 3.2.4. One thought that I had is that the `id` variable in my `predData` object is a bunch of numbers stored as characters (not my idea). Could `predictInterval()` be converting it to numeric? – dmp Mar 30 '16 at 17:15
  • It's a great thing to point out -- I can test that case easily first. – jknowles Mar 30 '16 at 19:18

1 Answers1

1

I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.

library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
              id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
              newdata=d,
              newparams=list(beta=c(2, -1.1), theta=c(.25),
                             sigma = c(.23)), seed =463)[[1]]
 lmeFit <- lmer(y ~ x + (1|z), data = d)
 predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
              id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])

But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?

Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?

jknowles
  • 479
  • 3
  • 13