0

I'm trying to understand what actually happens when using exclude to exclude subject-random slope terms from the prediction function mgcv::predict.gam(). I'm finding that regardless of whether I exclude or include these, the predicted values are the same. Can someone please help me understand why?

Some other context: I previously asked a question about excluding random effects here. In my last question I was advised to use exclude to exclude certain random effects if using mgcv::predict.gam. Here, I also read that it is always advised to include the random effect covariate in the test/prediction data, and then exclude these in the prediction function. That post mentioned that the random effect covariate in the test data can contain arbitrary values, as predict.gam will discard these. However, I am finding that this advice does not apply to covariates that are modelled as random slope terms (i.e., the values in the random effect covariate affect the predicted values through the random slope terms).

Take this example.

# create training data
set.seed(123)
x <- seq(1,10,1)
resampled <- sample(x,100,replace=T)

dat <- tibble(
  Reference_fact = as.factor(resampled),
  Country_fact = as.factor(rep(stri_rand_strings(10,2),10)),
  exp_A = runif(100,0,500),
  exp_B = runif(100,0,20),
  resp = exp_A^2 + 0.1*exp_B * 0.3*exp_A*exp_B + rnorm(100, 0, 10)
)
# define model specification
model_spec <- c("resp ~ 
                 s(exp_A) +
                 s(exp_B) +
                 s(Reference_fact, bs = 're') + 
                 s(Country_fact, bs = 're') +
                 s(exp_A, Reference_fact,  bs = 're') +
                 s(exp_B, Reference_fact, bs = 're')"
)

fit <- mgcv::gam(formula(str_replace_all(model_spec, "[\r\n]", "")),
                 method = 'REML',
                 family = 'gaussian',
                 data=dat)

set.seed(123)
dat_pred <- tibble(
  Reference_fact = as.factor(rep(seq(1,10,1),5)),
  Country_fact = as.factor(rep(stri_rand_strings(10,2),5)), 
  exp_A = runif(50,0,500),
  exp_B = runif(50,0,20)
)

pr <- mgcv::predict.gam(
  fit, 
  dat_pred, 
  exclude = c("s(Reference_fact)"),
  include=c(
              "s(exp_A, Reference_fact)",
              "s(exp_B, Reference_fact)"
              ),
  keep_prediction_data = FALSE, 
  newdata.guaranteed = TRUE, 
  se.fit = FALSE) 

pr1 <- mgcv::predict.gam(
  fit, 
  dat_pred, 
  exclude = c("s(Reference_fact)",
    "s(exp_A, Reference_fact)", 
    "s(exp_B, Reference_fact)"
  ),
  keep_prediction_data = FALSE, 
  newdata.guaranteed = TRUE, 
  se.fit = FALSE) 

I expect the output of pr and pr1 to be different. However, they are the same.

> pr
          1           2           3           4           5 
200516.2393 121796.3625 104512.3974 247897.1752 108595.6892 
          6           7           8           9          10 
126261.5061  74434.1023  89602.9046  20708.9620   5000.3689 
         11          12          13          14          15 
232978.0968 205927.3882 120337.7975 160699.1463   -467.0361 
         16          17          18          19          20 
 57676.7601 147143.7946  13544.9029  27155.7492  13026.8152 
         21          22          23          24          25 
  4670.3186  44073.4473  42882.7365  34971.4159   5406.6882 
         26          27          28          29          30 
  4348.2832  14815.5571  54557.0164  17806.6748 185469.5308 
         31          32          33          34          35 
  1041.5902  49111.2511 160814.6226   5908.1544  79419.0126 
         36          37          38          39          40 
 12468.5101   5869.1082 143593.8867 201719.6693  34835.5460 
         41          42          43          44          45 
113740.5525   1948.7769  36794.7991  21055.7098 168001.1933 
         46          47          48          49          50 
 50446.2849 165451.7506 168251.5211 159329.6230  48757.5280

> pr1
          1           2           3           4           5 
200516.2393 121796.3625 104512.3974 247897.1752 108595.6892 
          6           7           8           9          10 
126261.5061  74434.1023  89602.9046  20708.9620   5000.3689 
         11          12          13          14          15 
232978.0968 205927.3882 120337.7975 160699.1463   -467.0361 
         16          17          18          19          20 
 57676.7601 147143.7946  13544.9029  27155.7492  13026.8152 
         21          22          23          24          25 
  4670.3186  44073.4473  42882.7365  34971.4159   5406.6882 
         26          27          28          29          30 
  4348.2832  14815.5571  54557.0164  17806.6748 185469.5308 
         31          32          33          34          35 
  1041.5902  49111.2511 160814.6226   5908.1544  79419.0126 
         36          37          38          39          40 
 12468.5101   5869.1082 143593.8867 201719.6693  34835.5460 
         41          42          43          44          45 
113740.5525   1948.7769  36794.7991  21055.7098 168001.1933 
         46          47          48          49          50 
 50446.2849 165451.7506 168251.5211 159329.6230  48757.5280  

Am I missing some very intuitive reason for why the random slope effects "s(exp_A, Reference_fact)", "s(exp_B, Reference_fact)" can/can't be excluded? At the moment, because pr and pr1 are the same, I don't know whether these are being excluded or included in the predict function. Any guidance would be very much appreciated.

Following the point above, if I set Reference_fact arbitrary to 0 in the prediction data, this gives different predicted values even if all random effects terms are excluded in the predict function. This, I could understand, because essentially there are no random slope coefficients being used, as Reference_fact=0 doesn't exist, therefore predicted values are just predicted using the population mean. Right?

dat_pred_arbitrary <- tibble(
  Reference_fact = 0,
  Country_fact = as.factor(rep(stri_rand_strings(10,2),5)), 
  exp_A = runif(50,0,500),
  exp_B = runif(50,0,20)
)

pa <- mgcv::predict.gam(
  fit, 
  dat_pred_arbitrary, 
  exclude = c("s(Reference_fact)",
    "s(exp_A, Reference_fact)",
    "s(exp_B, Reference_fact)"
  ),
  keep_prediction_data = FALSE, 
  newdata.guaranteed = TRUE, 
  se.fit = FALSE) 

> pa
         1          2          3          4          5 
200488.956 121788.106 104520.843 247897.029 108629.512 
         6          7          8          9         10 
126259.139  74472.436  89594.384  20708.678   4998.486 
        11         12         13         14         15 
232969.281 205918.624 120342.779 160626.310   -459.715 
        16         17         18         19         20 
 57672.117 147243.220  13532.432  27155.012  13023.850 
        21         22         23         24         25 
  4665.593  44064.876  42886.827  34910.752   5429.489 
        26         27         28         29         30 
  4346.282  14894.535  54555.745  17806.292 185460.863 
        31         32         33         34         35 
  1019.899  49106.885 160820.450   5820.028  79453.377 
        36         37         38         39         40 
 12459.032   5961.430 143585.435 201719.366  34833.051 
        41         42         43         44         45 
113706.734   1944.823  36795.537  20968.192 168052.472 
        46         47         48         49         50 
 50444.755 165507.188 168238.245 159329.166  48750.677

midily
  • 1
  • 2

1 Answers1

0

I've just realised that exclude is sensitive to spaces in the vector passed to it. So if I am excluding more than one random effect, it needs to appear exactly as it appears in the summary(fit), not as it appears in my original model specification. Otherwise, those effects are not excluded.

E.g.

pr1 <- mgcv::predict.gam(
  fit, 
  dat_pred, 
  exclude = c("s(Reference_fact)",
    "s(exp_A,Reference_fact)", 
    "s(exp_B,Reference_fact)"
  ),
  keep_prediction_data = FALSE, 
  newdata.guaranteed = TRUE, 
  se.fit = FALSE) 

> pr1
         1          2          3          4          5 
200488.956 121788.106 104520.843 247897.029 108629.512 
         6          7          8          9         10 
126259.139  74472.436  89594.384  20708.678   4998.486 
        11         12         13         14         15 
232969.281 205918.624 120342.779 160626.310   -459.715 
        16         17         18         19         20 
 57672.117 147243.220  13532.432  27155.012  13023.850 
        21         22         23         24         25 
  4665.593  44064.876  42886.827  34910.752   5429.489 
        26         27         28         29         30 
  4346.282  14894.535  54555.745  17806.292 185460.863 
        31         32         33         34         35 
  1019.899  49106.885 160820.450   5820.028  79453.377 
        36         37         38         39         40 
 12459.032   5961.430 143585.435 201719.366  34833.051 
        41         42         43         44         45 
113706.734   1944.823  36795.537  20968.192 168052.472 
        46         47         48         49         50 
 50444.755 165507.188 168238.245 159329.166  48750.677

Trivial, but mystery solved! Incidentally, this also confirms that excluding all random effects terms gives the same predicted values as passing a zero vector to the Reference_fact covariate in the test data (which prevents any random effects from being used; thus only using the population mean).

midily
  • 1
  • 2