3

I'm working on a hurdle model and ran into a question I can't quite figure out. It was my understanding that the overall response prediction of the hurdle is the multiplication of the count prediction by the probability prediction. I.e., the overall response has to be smaller or equal to the count prediction. However, in my data, the response prediction is higher than the count prediction, and I can't figure out why.

Here's a similar result for a toy model (code adapted from here):

library("pscl") 
data("RecreationDemand", package = "AER") 

## model 
m <- hurdle(trips ~ quality | ski, data = RecreationDemand, dist = "negbin") 
nd <- data.frame(quality = 0:5, ski = "no")
predict(m, newdata = nd, type = "count")
predict(m, newdata = nd, type = "response")

Why is it that the counts are higher than the responses?

added comparison to glm.nb

Also - I was under the impression that the count part of the hurdle model should give identical predictions to a count-model of only positive values. When I try that, I get completely different values. What am I missing??

library(MASS)
m.nb <- glm.nb(trips ~ quality, data = RecreationDemand[RecreationDemand$trips > 0,]) 
predict(m, newdata = nd, type = "count") ## hurdle
predict(m.nb, newdata = nd, type = "response") ## positive counts only
user2602640
  • 640
  • 6
  • 21

1 Answers1

1

The last question is the easiest to answer. The "count" part of the hurdle modle is not simply a standard count model (including a positive probability for zeros) but a zero-truncated count model (where zeros cannot occur).

Using the countreg package from R-Forge you can fit the model you attempted to fit with glm.nb in your example. (Alternatively, VGAM or gamlss could also be used to fit the same model.)

library("countreg")
m.truncnb <- zerotrunc(trips ~ quality, data = RecreationDemand,
  subset = trips > 0, dist = "negbin")
cbind(hurdle = coef(m, model = "count"), zerotrunc = coef(m.truncnb), negbin = coef(m.nb))
##                 hurdle  zerotrunc     negbin
## (Intercept) 0.08676189 0.08674119 1.75391028
## quality     0.02482553 0.02483015 0.01671314

Up to small numerical differences the first two models are exactly equivalent. The non-truncated model, however, has to compensate the lack of zeros by increasing the intercept and dampening the slope parameter, which is clearly not appropriate here.

As for the predictions, one can distinguish three quantities:

  1. The expectation from the untruncated count part, i.e., simply exp(x'b).
  2. The conditional/truncated expectation from the count part, i.e., accounting for the zero trunctation: exp(x'b)/(1 - f(0)) where f(0) is the probability for 0 in that count part.
  3. The overall expectation for the complete hurdle model, i.e., the probability for crossing the hurdle times the conditional expectation from 2.: exp(x'b)/(1 - f(0)) * (1 - g(0)) where g(0) is the probability for 0 in the zero hurdle part of the model.

See also Section 2.2 and Appendix C in vignette("countreg", package = "pscl") for more details and formulas. predict(..., type = "count") computes item 1 from above where predict(..., type = "response") computes item 3 for a hurdle model and item 2 for a zerotrunc model.

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
  • Hey @AchimZeileis, bringing this thread back because I'm having a similar issue. The expression in (3) you posted is giving me the same value as using `predict(model, type = 'count')` and also as expression (1), while `predict(model, type = 'response')` is returning a far larger (2.5x) value. I'm not sure that follows my intuition for what should be happening. What am I missing? – Matt Kaye Apr 06 '21 at 15:11
  • Hard to say without a concrete reproducible example... – Achim Zeileis Apr 06 '21 at 17:00
  • I'll try to come up with a `reprex`. Give me a few on that – Matt Kaye Apr 06 '21 at 17:01
  • link to question here: https://stackoverflow.com/questions/66973477/pscl-hurdle-model-count-prediction-response-prediction – Matt Kaye Apr 06 '21 at 17:22
  • Thanks for the reprex. I'm not sure though whether a separate question is needed, especially as we need to refer to the items 1./2./3. above. In your code you correctly computed exp(x'b) and g(0). But to compute 3. you additionally need f(0) and then apply the equation given above. – Achim Zeileis Apr 06 '21 at 23:10
  • Thanks! And the only reason for the additional question was formatting of the reprex. AFAIK, the formatting in the comments is doomed to be terrible. Do you mind explaining what I'm misunderstanding about the f(0)? My intuition says the expected count conditioned on clearing the hurdle should be the overall expectation divided by the probability of clearing the hurdle, but clearly that isn't the correct way to think about it. Is there an intuitive explanation for why the `pnbinom(...)` needs to be included? What's that adding? Many thanks again, I really appreciate the time and the help! – Matt Kaye Apr 07 '21 at 00:20
  • You are right that the overall expectation is the product E(Y) = E(Y | Y > 0) * P(Y > 0). And P(Y > 0) is 1 - g(0) in the notation from my post. What you got wrong is that you think E(Y | Y > 0) is exp(x'b). This is not the case. We need the truncated/conditional expectation and exp(x'b) gives the expectation of the untruncated count variable. By dividing this with 1 - f(0) we get the truncated/conditional expectation we need. – Achim Zeileis Apr 08 '21 at 09:48
  • I see. Thanks again for clarifying! – Matt Kaye Apr 08 '21 at 12:59