I'm trying to understand what actually happens when using exclude
to exclude subject-random slope terms from the prediction function mgcv::predict.gam()
. I'm finding that regardless of whether I exclude
or include
these, the predicted values are the same. Can someone please help me understand why?
Some other context: I previously asked a question about excluding random effects here. In my last question I was advised to use exclude
to exclude certain random effects if using mgcv::predict.gam
. Here, I also read that it is always advised to include the random effect covariate in the test/prediction data, and then exclude these in the prediction function. That post mentioned that the random effect covariate in the test data can contain arbitrary values, as predict.gam
will discard these. However, I am finding that this advice does not apply to covariates that are modelled as random slope terms (i.e., the values in the random effect covariate affect the predicted values through the random slope terms).
Take this example.
# create training data
set.seed(123)
x <- seq(1,10,1)
resampled <- sample(x,100,replace=T)
dat <- tibble(
Reference_fact = as.factor(resampled),
Country_fact = as.factor(rep(stri_rand_strings(10,2),10)),
exp_A = runif(100,0,500),
exp_B = runif(100,0,20),
resp = exp_A^2 + 0.1*exp_B * 0.3*exp_A*exp_B + rnorm(100, 0, 10)
)
# define model specification
model_spec <- c("resp ~
s(exp_A) +
s(exp_B) +
s(Reference_fact, bs = 're') +
s(Country_fact, bs = 're') +
s(exp_A, Reference_fact, bs = 're') +
s(exp_B, Reference_fact, bs = 're')"
)
fit <- mgcv::gam(formula(str_replace_all(model_spec, "[\r\n]", "")),
method = 'REML',
family = 'gaussian',
data=dat)
set.seed(123)
dat_pred <- tibble(
Reference_fact = as.factor(rep(seq(1,10,1),5)),
Country_fact = as.factor(rep(stri_rand_strings(10,2),5)),
exp_A = runif(50,0,500),
exp_B = runif(50,0,20)
)
pr <- mgcv::predict.gam(
fit,
dat_pred,
exclude = c("s(Reference_fact)"),
include=c(
"s(exp_A, Reference_fact)",
"s(exp_B, Reference_fact)"
),
keep_prediction_data = FALSE,
newdata.guaranteed = TRUE,
se.fit = FALSE)
pr1 <- mgcv::predict.gam(
fit,
dat_pred,
exclude = c("s(Reference_fact)",
"s(exp_A, Reference_fact)",
"s(exp_B, Reference_fact)"
),
keep_prediction_data = FALSE,
newdata.guaranteed = TRUE,
se.fit = FALSE)
I expect the output of pr
and pr1
to be different. However, they are the same.
> pr
1 2 3 4 5
200516.2393 121796.3625 104512.3974 247897.1752 108595.6892
6 7 8 9 10
126261.5061 74434.1023 89602.9046 20708.9620 5000.3689
11 12 13 14 15
232978.0968 205927.3882 120337.7975 160699.1463 -467.0361
16 17 18 19 20
57676.7601 147143.7946 13544.9029 27155.7492 13026.8152
21 22 23 24 25
4670.3186 44073.4473 42882.7365 34971.4159 5406.6882
26 27 28 29 30
4348.2832 14815.5571 54557.0164 17806.6748 185469.5308
31 32 33 34 35
1041.5902 49111.2511 160814.6226 5908.1544 79419.0126
36 37 38 39 40
12468.5101 5869.1082 143593.8867 201719.6693 34835.5460
41 42 43 44 45
113740.5525 1948.7769 36794.7991 21055.7098 168001.1933
46 47 48 49 50
50446.2849 165451.7506 168251.5211 159329.6230 48757.5280
> pr1
1 2 3 4 5
200516.2393 121796.3625 104512.3974 247897.1752 108595.6892
6 7 8 9 10
126261.5061 74434.1023 89602.9046 20708.9620 5000.3689
11 12 13 14 15
232978.0968 205927.3882 120337.7975 160699.1463 -467.0361
16 17 18 19 20
57676.7601 147143.7946 13544.9029 27155.7492 13026.8152
21 22 23 24 25
4670.3186 44073.4473 42882.7365 34971.4159 5406.6882
26 27 28 29 30
4348.2832 14815.5571 54557.0164 17806.6748 185469.5308
31 32 33 34 35
1041.5902 49111.2511 160814.6226 5908.1544 79419.0126
36 37 38 39 40
12468.5101 5869.1082 143593.8867 201719.6693 34835.5460
41 42 43 44 45
113740.5525 1948.7769 36794.7991 21055.7098 168001.1933
46 47 48 49 50
50446.2849 165451.7506 168251.5211 159329.6230 48757.5280
Am I missing some very intuitive reason for why the random slope effects "s(exp_A, Reference_fact)", "s(exp_B, Reference_fact)"
can/can't be excluded? At the moment, because pr
and pr1
are the same, I don't know whether these are being excluded or included in the predict function. Any guidance would be very much appreciated.
Following the point above, if I set Reference_fact
arbitrary to 0
in the prediction data, this gives different predicted values even if all random effects terms are excluded in the predict function. This, I could understand, because essentially there are no random slope coefficients being used, as Reference_fact=0
doesn't exist, therefore predicted values are just predicted using the population mean. Right?
dat_pred_arbitrary <- tibble(
Reference_fact = 0,
Country_fact = as.factor(rep(stri_rand_strings(10,2),5)),
exp_A = runif(50,0,500),
exp_B = runif(50,0,20)
)
pa <- mgcv::predict.gam(
fit,
dat_pred_arbitrary,
exclude = c("s(Reference_fact)",
"s(exp_A, Reference_fact)",
"s(exp_B, Reference_fact)"
),
keep_prediction_data = FALSE,
newdata.guaranteed = TRUE,
se.fit = FALSE)
> pa
1 2 3 4 5
200488.956 121788.106 104520.843 247897.029 108629.512
6 7 8 9 10
126259.139 74472.436 89594.384 20708.678 4998.486
11 12 13 14 15
232969.281 205918.624 120342.779 160626.310 -459.715
16 17 18 19 20
57672.117 147243.220 13532.432 27155.012 13023.850
21 22 23 24 25
4665.593 44064.876 42886.827 34910.752 5429.489
26 27 28 29 30
4346.282 14894.535 54555.745 17806.292 185460.863
31 32 33 34 35
1019.899 49106.885 160820.450 5820.028 79453.377
36 37 38 39 40
12459.032 5961.430 143585.435 201719.366 34833.051
41 42 43 44 45
113706.734 1944.823 36795.537 20968.192 168052.472
46 47 48 49 50
50444.755 165507.188 168238.245 159329.166 48750.677