1

I have a data set I am trying to fit with bam() in the mgcv package. The model has a binary outcome and I need to specify random intercepts for each animal ID. A subset of the data is below (my actual data is much larger with many more covariates):

dat2 <- read.csv('https://github.com/silasbergen/example_data/raw/main/dat2.csv')
dat2$Animal_id <- factor(dat2$Animal_id)
> head(dat2)
  Animal_id DEM_IA Anyrisk
1       105 279.94       0
2       105 278.68       0
3       106 329.13       0
4       106 329.93       0
5       106 332.25       0
6       106 333.52       0
> summary(dat2)
 Animal_id        DEM_IA         Anyrisk      
 105:     2   Min.   :156.3   Min.   :0.0000  
 106: 83252   1st Qu.:246.8   1st Qu.:0.0000  
 107: 22657   Median :290.1   Median :0.0000  
 108:104873   Mean   :284.8   Mean   :0.3619  
 109:142897   3rd Qu.:318.0   3rd Qu.:1.0000  
 110: 53967   Max.   :411.8   Max.   :1.0000 

I want to fit the model and predict for new data without the random effect:

library(mgcv)
mod <- bam(Anyrisk ~s(Animal_id,bs="re") + s(DEM_IA), data = dat2, family = "binomial",discrete=TRUE)
topred <-  data.frame(DEM_IA = c(280,320))
predict(mod, newdata = topred, exclude="s(Animal_id)",newdata.guaranteed = TRUE)

But this throws an error:

Error in eval(predvars, data, env) : object 'Animal_id' not found

Why would it need Animal_id when I am specifically telling it to exclude this term from the prediction? This is also especially strange as I can run the similar examples in the ?random.effects mgcv help file, no problem, even if I modify those examples to use bam() instead of gam()! Any help would be greatly appreciated!

EDIT

I may have found a fix; apparently if using discrete=TRUE in the bam() model, then predict.bam() also uses discrete=TRUE which will not work with missing random effect, but this works:

mod<- bam(Anyrisk ~s(Animal_id,bs="re") + s(DEM_IA), data = dat2, family = "binomial",discrete=TRUE)
topred <-  data.frame(DEM_IA = c(280,320))
predict(mod,topred, exclude="s(Animal_id)",newdata.guaranteed = TRUE,discrete=FALSE)

Output:

         1          2 
-0.4451066 -0.0285989 
Silas Bergen
  • 21
  • 1
  • 4
  • 1
    mgcv needs to build all the model terms by evaluating the bases before it can generate predictions. It would make the code more complex if at every evaluation of the bases you had to check whether you were including or excluding those terms etc, when all that is needed to ensure that is to zero out groups of columns from the Xp matrix immediately prior to computing the predicted values. Hence you always have to provide something for all terms used in the model, even though mgcv will later exclude the term; in your case just pass any one value of the possible levels for the ranef. – Gavin Simpson Dec 23 '20 at 15:02
  • 1
    I would really recommend you don't use `newdata.guaraneteed` unless you are absolutely certain you have created `newdata` correctly. While you can omit a term if it's not needed because of `exclude`, and Simon suggests this as one way to ignore ranefs, as you've found this doesn't work all the time with every model. Providing something for all terms in the model will work everywhere. – Gavin Simpson Dec 23 '20 at 15:04
  • I discovered that if `discrete=TRUE` is used in the `predict()` function, random effects are _not_ ignored even if `exclude = ` is specified. Note: `> topred <- data.frame(DEM_IA = c(280,280),Animal_id = dat2$Animal_id[c(1,3)]) > predict(mod,topred, exclude="s(Animal_id)",discrete=TRUE) 1 2 -0.8157566 -0.7150832 > predict(mod,topred, exclude="s(Animal_id)",discrete=FALSE) 1 2 -0.4451066 -0.4451066 ` – Silas Bergen Dec 23 '20 at 15:35
  • 1
    That was supposed to have been fixed in 1.8-32; check you are running version 1.8-33 of mgcv – Gavin Simpson Dec 23 '20 at 15:45
  • I'm using 1.8-31! I'll update; thanks. – Silas Bergen Dec 23 '20 at 17:35

1 Answers1

3

tl;dr work around this by putting something in for Animal_id, it doesn't matter what value you specify (not NA though ...)

Why? Can't say for sure without more digging in the code, but ... it's often convenient to use model.frame(formula, newdata) as a step toward computing the required model matrix. (For example, one could proceed by constructing the whole model matrix, then zeroing out columns to be ignored ...) Figuring out which terms can be deleted from the formula may be a separate, more difficult step. (I don't know why it works differently in bam and gam though ...)

This appears to work fine:

topred <-  data.frame(DEM_IA = c(280,320),
                      Animal_id=dat2$Animal_id[1])
predict(mod, newdata = topred, exclude="s(Animal_id)",newdata.guaranteed = TRUE)

Check that it really doesn't matter what you specify for Animal_id:

res <- lapply(levels(dat2$Animal_id),
           function(i) {
             dd <- transform(topred, Animal_id=i)
               predict(mod, newdata = dd, 
                       exclude="s(Animal_id)",newdata.guaranteed = TRUE)
           })
do.call(rbind,res)

Results:

              1          2
[1,] -0.4451066 -0.0285989
[2,] -0.4451066 -0.0285989
[3,] -0.4451066 -0.0285989
[4,] -0.4451066 -0.0285989
[5,] -0.4451066 -0.0285989
[6,] -0.4451066 -0.0285989
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Bizarre, I cannot replicate this result! `topred <- data.frame(DEM_IA = c(280,320), Animal_id=dat2$Animal_id[1]) topred2 <- data.frame(DEM_IA = c(280,320), Animal_id=dat2$Animal_id[3]) predict(mod, newdata = topred, exclude="s(Animal_id)",newdata.guaranteed = TRUE) predict(mod, newdata = topred2, exclude="s(Animal_id)",newdata.guaranteed = TRUE)` produce different predictions for me; so does your lapply() command. What version of mgcv are you using? – Silas Bergen Dec 23 '20 at 03:15
  • 1
    +1 Though you shouldn't need `newdata.gauranteed` and it's probably best not to use it as it bypasses sanity checks on the data. You need to provide something (other than `NA` etc) so that mgcv can form the relevant model components, while `exclude` and `terms` only affect which groups of columns get used in computing the predicted values, which happens later in the code. – Gavin Simpson Dec 23 '20 at 14:58