6

I have semicontinuous data (many exact zeros and continuous positive outcomes) that I am trying to model. I have largely learned about modeling data with substantial zero mass from Zuur and Ieno's Beginner's Guide to Zero-Inflated Models in R, which makes a distinction between zero-inflated gamma models and what they call "zero-altered" gamma models, which they describe as hurdle models that combine a binomial component for the zeros and a gamma component for the positive continuous outcome. I have been exploring the use of the ziGamma option in the glmmTMB package and comparing the resulting coefficients to a hurdle model that I built following the instructions in Zuur's book (pages 128-129), and they do not coincide. I'm having trouble understanding why not, as I know that the gamma distribution cannot take on the value of zero, so I suppose every zero-inflated gamma model is technically a hurdle model. Can anyone illuminate this for me? See more comments about the models below the code.

library(tidyverse)
library(boot)
library(glmmTMB)
library(parameters)

### DATA

id <- rep(1:75000)
age <- sample(18:88, 75000, replace = TRUE)
gender <- sample(0:1, 75000, replace = TRUE)
cost <- c(rep(0, 30000), rgamma(n = 37500, shape = 5000, rate = 1), 
          sample(1:1000000, 7500, replace = TRUE))
disease <- sample(0:1, 75000, replace = TRUE)
time <- sample(30:3287, 75000, replace = TRUE)

df <- data.frame(cbind(id, disease, age, gender, cost, time))

# create binary variable for non-zero costs

df <- df %>% mutate(cost_binary = ifelse(cost > 0, 1, 0))

### HURDLE MODEL (MY VERSION)

# gamma component

hurdle_gamma <- glm(cost ~ disease + gender + age + offset(log(time)), 
                    data = subset(df, cost > 0),
                    family = Gamma(link = "log"))

model_parameters(hurdle_gamma, exponentiate = T)

# binomial component

hurdle_binomial <-  glm(cost_binary ~ disease + gender + age + time, 
                        data = df, family = "binomial")

model_parameters(hurdle_binomial, exponentiate = T)

# predicted probability of use

df$prob_use <- predict(hurdle_binomial, type = "response")

# predicted mean cost for people with any cost

df_bin <- subset(df, cost_binary == 1)

df_bin$cost_gamma <- predict(hurdle_gamma, type = "response")

# combine data frames

df2 <- left_join(df, select(df_bin, c(id, cost_gamma)), by = "id")

# replace NA with 0

df2$cost_gamma <- ifelse(is.na(df2$cost_gamma), 0, df2$cost_gamma)

# calculate predicted cost for everyone

df2 <- df2 %>% mutate(cost_pred = prob_use * cost_gamma)

# mean predicted cost

mean(df2$cost_pred)

### glmmTMB with ziGamma

zigamma_model <- glmmTMB(cost ~ disease + gender + age + offset(log(time)),
                         family = ziGamma(link = "log"),
                         ziformula = ~ disease + gender + age + time,
                         data = df)

model_parameters(zigamma_model, exponentiate = T)

df <- df %>% predict(zigamma_model, new data = df, type = "response") # doesn't work
# "no applicable method for "predict" applied to an object of class "data.frame"

The coefficients from the gamma component of my hurdle model and the fixed effects components of the zigamma model are the same, but the SEs are different, which in my actual data has substantial implications for the significance of my predictor of interest. The coefficients on the zero-inflated model are different, and I also noticed that the z values in the binomial component are the negative inverse of those in my binomial model. I assume this has to do with my binomial model modeling the probability of presence (1 is a success) and glmmTMB presumably modeling the probability of absence (0 is a success)?

In sum, can anyone point out what I am doing wrong with the glmmTMB ziGamma model?

Kellan Baker
  • 375
  • 3
  • 11
  • You could probably get this question reopened (if you want) by rephrasing as **solving a problem** rather than **requesting a package**. I think your question *should* be on-topic, but the consensus is that it isn't (see this [meta question and answers](https://meta.stackoverflow.com/questions/295845/might-is-there-a-package-rather-than-whats-the-best-package-questions-be-ok)) – Ben Bolker Jan 17 '21 at 17:45
  • I have edited it per your suggestion, which I hope will be sufficient for it to be re-opened. – Kellan Baker Jan 18 '21 at 01:34
  • Will work on this tomorrow. I think you're exactly right about the sign reversal (glmmTMB is predicting the probability of a zero rather than of a non-zero value). Can you say what kind of predictions you want to make (see `glmmTMB::predict.glmmTMB`) ? – Ben Bolker Jan 18 '21 at 02:05
  • I appreciated the discussion you linked to. I didn’t think my question was inappropriate, because I am not the most savvy navigator of the R universe and know there’s a lot out there that I don’t know about that others easily do - like your direction to the glmmTMB package. I had searched all over the internet for resources on gamma hurdle models and had come up with next to nothing, and posting here led me to your recommendation. I work hard to be thoughtful about my questions and feel like Stack Overflow often has an itchy trigger finger for abruptly closing questions without much feedback. – Kellan Baker Jan 18 '21 at 14:38
  • Sorry, your question wasn't inappropriate, it was just considered *off-topic*. As suggested in that discussion, I disagree with the consensus here, but feel bound to respect it. I think your question is probably re-openable now ... – Ben Bolker Jan 18 '21 at 14:45
  • Are your model and the version according to Zuur identical? If so, it might make this easier to read if you skip Zuur's ... – Ben Bolker Jan 18 '21 at 15:51
  • 1
    Yes, they are. Edited to remove Zuur's, which I had included just to bolster (my) confidence that I had done the hurdle model correctly, at least by their lights. Re: prediction, what I'm trying to do is to predict the cost for each person in my original data (pi * mu). I read the glmmTMB::predict overview but am not sure what I'm doing wrong – Kellan Baker Jan 18 '21 at 16:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/227500/discussion-between-kellan-baker-and-ben-bolker). – Kellan Baker Jan 18 '21 at 18:05

1 Answers1

6

The glmmTMB package can do this:

glmmTMB(formula, family=ziGamma(link="log"), ziformula=~1, data= ...)

ought to do it. Maybe something in VGAM as well?


To answer the questions about coefficients and standard errors:

  • the change in sign of the binomial coefficients is exactly what you suspected (the difference between estimating the probability of 0 [glmmTMB] vs the probability of not-zero [your/Zuur's code])
  • The standard errors on the binomial part of the model are close but not identical: using broom.mixed::tidy,
round(1-abs(tidy(hurdle_g,component="zi")$statistic)/
      abs(tidy(hurdle_binomial)$statistic),3)
## [1] 0.057 0.001 0.000 0.000 0.295

6% for the intercept, up to 30% for the effect of age ...

  • the nearly twofold difference in the standard errors of the conditional (cost>0) component is definitely puzzling me; it holds up if we simply implement the Gamma/log-link in glmmTMB vs glm. It's hard to know how to check which is right/what the gold standard should be for this case. I might distrust Wald p-values in this case and try to get p-values with the likelihood ratio test instead (via drop1).

In this case the model is badly misspecified (i.e. the cost is uniformly distributed, nothing like Gamma); I wonder if that could be making things harder/worse?

More recent versions of glmmTMB can also use a Tweedie distribution for the conditional distribution of the response; this distribution allows for a combination of zeros and continuous positive values.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks, @Ben Bolker! I have used VGAM for non-hurdle zero-truncated negative binomial models, but unless I've missed something more recent, it does not accommodate gamma hurdle models. Is there any more detailed guidance on using glmmTMB with zigamma beyond https://cran.r-project.org/web/packages/glmmTMB/vignettes/glmmTMB.pdf? I'm not 100% following the example of fitting the hurdle model as opposed to a zero-inflated model with a gamma distribution. – Kellan Baker Jan 16 '21 at 03:12
  • Maybe you can clarify: as far as I know the "zero-inflated model with gamma distribution" is the same as the hurdle model. (You could check the results of your hurdle code against `glmmTMB` results ...) – Ben Bolker Jan 16 '21 at 17:56
  • I have typically heard hurdle models described as "two-part" or "zero-altered." "Zero-inflated" to me means a model that does not consider the zeros and the continuous or count outcomes to come from two different processes. – Kellan Baker Jan 17 '21 at 01:11
  • When I compared my hurdle model to glmmTMB with ziGamma, the coefficient for my predictor of interest is the same in the fixed-effects model from glmmTMB and the gamma component of my hurdle model, but the standard error is twice as large in my hurdle model (enough to change the statistical significance of the coefficient). The coefficients from the glmmTMB zero-inflated model and the binomial part of my hurdle model are very different: one gives an exponentiated coefficient <1, and the other gives a coefficient >1. I don't understand how to parse this to figure out which model to use. – Kellan Baker Jan 17 '21 at 01:31
  • terminology problem, perhaps. Technically, since the Gamma distribution has no probability of producing a zero outcome (not only is Prob(0)=0, which it is for any continuous distribution, but the probability **density** is also zero if the shape parameter is > 1 (and infinite if shape < 1 ...). Can you edit your question to show the results of both approaches? (The zero "inflation" parameter in glmmTMB is fitted on the logit scale, in case that helps ...) – Ben Bolker Jan 17 '21 at 17:49
  • I updated the code to make the outcome a gamma-distributed variable -- thank you for pointing that out. It does not fix the differences between the two approaches, unfortunately. – Kellan Baker Jan 18 '21 at 16:52