1

I'm fitting a model in R using the quasipoisson family like this:

model <- glm(y ~ 0 + log_a + log_b + log_c + log_d + log_gm_a + 
   log_gm_b + log_gm_c + log_gm_d, family = quasipoisson(link = 'log'))

glm finds values for the first five coefficients. It says the others are NA. Interestingly, if I reorder the variables in the formula, glm always finds coefficients for the five variables that appear first in the formula.

There is sufficient data (the number of the rows is many times the number of parameters).

How should I interpret those NA coefficients?

The author of the model I'm implementing insists that the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model. I suspect something else is going on.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Bob
  • 1,274
  • 1
  • 13
  • 26

1 Answers1

2

My guess is that the author (who says "the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model") is wrong (although it's hard to be 100% sure without having the full context).

The problem is almost certainly that you have some multicollinear predictors. The reason that different variables get dropped/have NA coefficients returned is that R partly uses the order to determine which ones to drop (as far as the fitted model result goes, it doesn't matter - all of the top-level results (predictions, goodness of fit, etc.) are identical).

In comments the OP says:

The relationship between log_a and log_gm_a is that this is a multiplicative fixed-effects model. So log_a is the log of predictor a. log_gm_a is the log of the geometric mean of a. So each of the log_gm terms is constant across all observations.

This is the key information needed to diagnose the problem. Because the intercept is excluded from this model (the formula contains 0+, having one constant column in the model matrix is OK, but multiple constant columns is trouble; all but the first (in whatever order is specified by the formula) will be discarded. To go slightly deeper: the model requested is

Y = b1*C1 + b2*C2 + b3*C3 + [additional terms]

where C1, C2, C3 are constants. At the point in "data space" where the additional terms are 0 (i.e. for cases where log_a = log_b = log_c = ... = 0), we're left with predicting a constant value from three separate constant terms. Suppose that the intercept in a regular model (~ 1 + log_a + log_b + log_c) would have been m. Then any combination of (b1, b2, b3) that makes the sum equal to zero (and there are infinitely many) will fit the data equally well.

I still don't know much about the context, but it might be worth considering adding the constant terms as offsets in the model. Or scale the predictors by their geometric means/subtract the log-geom-means from the predictors?


In other cases, multicollinearity arises from unidentifiable interaction terms; nested variables; attempts to include all the levels of multiple categorical variables; or including the proportions of all levels of some compositional variable (e.g. proportions of habitat types, where the proportions add up to 1) in the model, e.g.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks Ben. The relationship between `log_a` and `log_gm_a` is that this is a multiplicative fixed-effects model. So `log_a` is the log of predictor `a`. `log_gm_a` is the log of the geometric mean of `a`. So each of the `log_gm` terms is constant across all observations. I'm very curious to hear your response, but the key concept I'm focused on, is that these variables are being dropped and are not, in fact, being used by the model as controls. – Bob Aug 24 '21 at 18:18
  • BTW - my understanding is that you removed support for `quasipoisson` from `lme4` because you were uncertain whether the output was reliable. The author of the model I'm working on insists that the reason the output was unreliable was the failure to include within-transformation terms such as he included in this model. I suspect he's mistaken about that as well... – Bob Aug 24 '21 at 18:29
  • Sounds funky to me. I would probably need to see the whole context before I could draw a conclusion, but I would doubt it. It's been a long time, and IIRC occurred slightly before I became fully involved as a maintainer of the package, but I very much doubt that Doug Bates would have missed something reasonably obvious ... – Ben Bolker Aug 24 '21 at 20:59
  • Without *directly* outing your informant, is there a link to/information about this model framework somewhere? – Ben Bolker Aug 24 '21 at 21:00
  • I'm afraid not, its commercial rather than academic work. I feel pretty confident you hit the nail on the head. Thank you Ben! – Bob Aug 24 '21 at 21:57
  • Just makes me worry about the validity of the modeling framework ... but, commercial, not my problem :-) – Ben Bolker Aug 24 '21 at 22:30
  • Oh my view is that the modelling framework here has a number of very serious issues, this is just one of them. What I've been trying to do is surface those issues, in part by reconstructing what was originally a stata model in R. In that regard, I would point out that stata *can* fit a mixed-effects model with quasipoisson... hint hint – Bob Aug 24 '21 at 22:42
  • I show how to (pretty easily) effectively fit QP in lme4- or glmmTMB-land [here](https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#fitting-models-with-overdispersion). The downside of open source/free software is that I've never bothered (since I'm not bound by market demand) to wrap this with a "quasi" family designation. (If you wanted to ask a "how do I fit a quasi- model with lme4 or glmmTMB?" question on SO I could showcase that answer - or you could.) – Ben Bolker Aug 24 '21 at 22:46
  • Here you go: https://stackoverflow.com/questions/68915173/how-do-i-fit-a-quasi-poisson-model-with-lme4-or-glmmtmb – Bob Aug 24 '21 at 23:38