R partykit: How do I use the offset?

Question

I am trying to predict the frequency of an outcome and I have a lot of data. I have already fitted a glm to the data and now I am trying to use ctree to understand any complex interaction in the dataset that I may have missed.

Instead of directly predicting the residual, I have tried to offset the ctree model to the glm prediction. However, I seem to get the same results when I: (a) use no offset at all, (b) specify the offset in the function, and (c) use the offset in the ctree equation.

I have tried looking at the documentation(here and here) but I have not found it helpful.

I have created some dummy data to mimic what I am doing:

library(partykit)

# Set random number seed
set.seed(15)

# Create Dataset
freq <- rpois(10000, 1.2)
example_df <- data.frame(var_1 = rnorm(10000, 180, 20) * freq / 10,
                        var_2 = runif(10000, 1, 8),
                        var_3 = runif(10000, 1, 2.5) + freq / 1000)
example_df$var_4 = example_df$var_1 * example_df$var_3 + rnorm(10000, 0.1, 0.5)
example_df$var_5 = example_df$var_2 * example_df$var_3 + rnorm(10000, 2, 50)

# Create GLM
base_mod <- glm(freq ~ ., family="poisson", data=example_df)
base_pred <- predict(base_mod)

# Create trees
exc_offset <- ctree(freq ~ ., data = example_df, control = ctree_control(alpha = 0.01, minbucket = 1000))
func_offset <- ctree(freq ~ ., data = example_df, offset = base_pred, control = ctree_control(alpha = 0.01, minbucket = 1000))
equ_offset <- ctree(freq ~ . + offset(base_pred), data = example_df, control = ctree_control(alpha = 0.01, minbucket = 1000))

I expected the outcomes of the trees to be different when the offset is included from when the offset isn't included. However, the outputs seem to be the same:

# Predict outcomes
summary(predict(exc_offset, example_df))
summary(predict(func_offset, example_df))
summary(predict(equ_offset, example_df))

# Show trees
exc_offset
func_offset
equ_offset

Does anyone know what is going on? Have should I use the offsets?

score 1 · Accepted Answer · answered Jul 18 '19 at 23:36

The ctree() algorithm is not based on a linear predictor and hence including an offset is not possible out-of-the-box. It is possible to include an offset by using a model-based ytrafo score, though. See vignette("ctree", package = "partykit") for more details (also available on CRAN at https://CRAN.R-project.org/web/packages/partykit/vignettes/ctree.pdf).

However, the more natural solution is to use a GLM model-based tree with the glmtree() function. I think you try to fit this tree:

glmtree(freq ~ ., data = example_df, offset = base_pred, family = poisson,
  alpha = 0.01, minsize = 1000)

See vignette("mob", package = "partykit") for more details (also available on CRAN at https://CRAN.R-project.org/web/packages/partykit/vignettes/mob.pdf).

But rather than estimating the offset once and then the tree once, it is also easily possible to iterate this process to obtain a better fit. We called this PALM tree (partially additive linear tree), available in the palmtree package (https://doi.org/10.1007/s11634-018-0342-1).

Finally, I would encourage you to explore which of the available covariates is used as:

regressors in the offset (global regressors)
regressors in each node (local regressors)
splitting variables

Possibly, the resulting model might be more interpretable when the right parts for each covariate.

Thank you for your detailed response. Unfortunately, I don't think we can use this due to commercial restrictions. But this answer might be helpful to others. — CHF18, Jul 25 '19 at 10:57
What does this mean? You are at liberty to use `partykit::ctree` but not `partykit::glmtree`? An alternative solution could then be to model the residuals of the `glm` (instead of `freq`) using `ctree`. — Achim Zeileis, Jul 25 '19 at 13:06
Restrictions are frustrating; I have ended up modelling to the residuals. Thank you. — CHF18, Jul 30 '19 at 08:58

R partykit: How do I use the offset?

1 Answers1