66

I am using the twang package to create propensity scores, which are used as weights in a binomial glm using survey::svyglm. The code looks something like this:

pscore <- ps(ppci ~ var1+var2+.........., data=dt....)

dt$w <- get.weights(pscore, stop.method="es.mean")

design.ps <- svydesign(ids=~1, weights=~w, data=dt,)

glm1 <- svyglm(m30 ~ ppci, design=design.ps,family=binomial)

This produces the following warning:

Warning message:
   In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!

Does anyone know what I could be doing wrong ?

I wasn't sure if this message would be better on stats.SE, but on balance I thought I would try here first.

Robert Long
  • 5,722
  • 5
  • 29
  • 50
  • What is type of variable is `m30`? – James Oct 18 '12 at 11:04
  • The weights must be non-integral then. A binomial fit tries to find the probability of success in a discrete number of trials. – James Oct 18 '12 at 11:36
  • @james the weights _are_ non-integral - they are inverse-probabilities (inverse of the propensity scores) - that's what the `twang`+`survey` combination is supposed to be implementing..... – Robert Long Oct 18 '12 at 12:01

4 Answers4

84

There's nothing wrong, glm is just picky when it comes to specifying binomial (and Poisson) models. It warns if it detects that the no. of trials or successes is non-integral, but it goes ahead and fits the model anyway. If you want to suppress the warning (and you're sure it's not a problem), use family=quasibinomial instead.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • Indeed, and IIRC all a GLM really needs to know is the stated mean-variance relationship (which is what the `quasi()` families do/allow), the form of the actual data doesn't really matter. The warning is more a bit of nannying I believe. – Gavin Simpson Oct 18 '12 at 12:05
  • 8
    Yes, although I've seen a lot of cases where people noticed they were doing something silly because of this warning ... – Ben Bolker Oct 18 '12 at 13:36
  • 3
    @BenBolker thanks for your comment. Of course, the reason I posted the question is that I am worried I am doing something silly. – Robert Long Oct 18 '12 at 18:57
  • Just to also note here, models fitted with a `quasibinomial` family tend to have `aic` as `NA`. `model$aic` returns NA – joel.wilson May 31 '17 at 16:00
  • The quasibinomial also has another parameter than the binomial. It should give different results. – pdb Jul 13 '17 at 17:51
  • 2
    @PaulBailey the estimated _regression coefficients_ are identical. Their _standard errors_ are different. – Hong Ooi Jul 13 '17 at 18:23
  • 1
    You need to specify scale = 1 if you use the quasibinomial instead of the binomial. Otherwize, you won't get the same results. Also, you can't get LogLikelihood or AIC with quasibinomila family. – Regis Mar 30 '21 at 10:58
21

Like @HoongOoi said, glm.fit with binomial family expects integer counts and throws a warning otherwise; if you want non-integer counts, use quasi-binomial. The rest of my answer compares these.

Quasi-binomial in R for glm.fit is exactly the same as binomial for the coefficient estimates (as mentioned in comments by @HongOoi) but not for the standard errors (as mentioned in the comment by @nograpes).

Comparison of source code

A diff on the source code of stats::binomial and stats::quasibinomial shows the following changes:

  • the text "binomial" becomes "quasibinomial"
  • the aic function returns NA instead of the calculated AIC

and the following removals:

  • setting outcomes to 0 when weights = 0
  • check on integrality of weights
  • simfun function to simulate data

Only simfun could make a difference, but the source code of glm.fit shows no use of that function, unlike other fields in the object returned by stats::binomial such as mu.eta and link.

Minimal working example

The results from using quasibinomial or binomial are the same for the coefficients in this minimal working example:

library('MASS')
library('stats')

gen_data <- function(n=100, p=3) {

  set.seed(1)  
  weights <- stats::rgamma(n=n, shape=rep(1, n), rate=rep(1, n))
  y <- stats::rbinom(n=n, size=1, prob=0.5)
  theta <- stats::rnorm(n=p, mean=0, sd=1)
  means <- colMeans(as.matrix(y) %*% theta)
  x <- MASS::mvrnorm(n=n, means, diag(1, p, p))

  return(list(x=x, y=y, weights=weights, theta=theta))  
}

fit_glm <- function(family) {
  data <- gen_data()
  fit <- stats::glm.fit(x = data$x,
                        y = data$y,
                        weights = data$weights,
                        family = family)
  return(fit)
}

fit1 <- fit_glm(family=stats::binomial(link = "logit"))
fit2 <- fit_glm(family=stats::quasibinomial(link = "logit"))

all(fit1$coefficients == fit2$coefficients)

Comparison with the quasibinomial probability distribution

This thread suggests that the quasibinomial distribution is different from the binomial distribution with an additional parameter phi. But they mean different things in statistics and in R.

First, no place in the source code of quasibinomial mentions that additional phi parameter.

Second, a quasiprobability is similar to a probability, but not a proper one. In this case, one cannot compute the term (n \choose k) when the numbers are non-integers, although one could with the Gamma function. This may be a problem for the definition of the probability distribution but is irrelevant for estimation, as the term (n choose k) do not depend on the parameter and fall out of optimisation.

The log-likelihood estimator is:

log-likelihood estimator

The log-likelihood as a function of theta with the binomial family is:

log-likelihood with binomial family

where the constant is independent of the parameter theta, so it falls out of optimisation.

Comparison of standard errors

The standard errors calculated by stats::summary.glm use a different dispersion value for the binomial and quasibinomial families, as mentioned in stats::summary.glm:

The dispersion of a GLM is not used in the fitting process, but it is needed to find standard errors. If dispersion is not supplied or NULL, the dispersion is taken as 1 for the binomial and Poisson families, and otherwise estimated by the residual Chisquared statistic (calculated from cases with non-zero weights) divided by the residual degrees of freedom.

...

cov.unscaled: the unscaled (dispersion = 1) estimated covariance matrix of the estimated coefficients.

cov.scaled: ditto, scaled by dispersion.

Using the the above minimal working example:

summary1 <- stats::summary.glm(fit1)
summary2 <- stats::summary.glm(fit2)

print("Equality of unscaled variance-covariance-matrix:")
all(summary1$cov.unscaled == summary2$cov.unscaled)

print("Equality of variance-covariance matrix scaled by `dispersion`:")
all(summary1$cov.scaled == summary2$cov.scaled)

print(summary1$coefficients)
print(summary2$coefficients)

shows the same coefficients, same unscaled variance-covariance matrix, and different scaled variance-covariance matrices:

[1] "Equality of unscaled variance-covariance-matrix:"
[1] TRUE
[1] "Equality of variance-covariance matrix scaled by `dispersion`:"
[1] FALSE
       Estimate Std. Error   z value   Pr(>|z|)
[1,] -0.3726848  0.1959110 -1.902317 0.05712978
[2,]  0.5887384  0.2721666  2.163155 0.03052930
[3,]  0.3161643  0.2352180  1.344133 0.17890528
       Estimate Std. Error   t value   Pr(>|t|)
[1,] -0.3726848  0.1886017 -1.976042 0.05099072
[2,]  0.5887384  0.2620122  2.246988 0.02690735
[3,]  0.3161643  0.2264421  1.396226 0.16583365
miguelmorin
  • 5,025
  • 4
  • 29
  • 64
  • 1
    The standard errors of the coefficients aren't calculated for the same way for the quasibinomial and binomial families. You can see the difference if you look at the `stats::summary.glm` function. For the binomial family (and Poisson), the dispersion is hardcoded to 1. For the quasibinomial family, the dispersion is calculated in the "usual" way. This leads to different scaled covariance matrix, leading to different standard errors. – nograpes Apr 04 '19 at 14:44
8

There is nothing wrong, computationally, but statistically you may not be doing something that makes much sense. In such a case, it is probably better to use a robust regression method, which is generally a good idea for proportional response data if your data include units with exactly 1 or exactly 0.

HaberdashPI
  • 402
  • 4
  • 9
  • 1
    "... also uses a different method to fit the data" -- this is wrong. The quasibinomial and binomial families use _exactly_ the same numerical method, ie IRLS with appropriately chosen mu and eta. The differences are that quasibinomial 1) suppresses the integer check, and 2) doesn't report an AIC, since it's technically not maximum likelihood-based. – Hong Ooi Oct 29 '16 at 07:57
  • 2
    You can check for yourself that quasibinomial is no more robust than binomial, just by generating random data and fitting models with the 2 families. You should find that, regardless of what the data is like or how close you are to linearly separable classes, the model estimates are exactly the same. – Hong Ooi Oct 29 '16 at 08:00
  • Thanks for the edification Hong Ooi! It appears I was misinformed, from another answer on a similar topic from StackExchange's Cross-validation. That is very good to know! It doesn't seem to me that using quasibinomial is a very good approach in this case then. – HaberdashPI Dec 05 '16 at 01:41
0

Sorry, but it is more robust in the sense that if the underlying mechanism is an overdispersed binomial model, the overdispersed binomial will account for it when estimating the standard erorr. Hence, you will get better coverage, even though point estimates are the same.

user2809432
  • 142
  • 1
  • 1
  • 8
  • In the `glm.fit` function, I see no way that the standard errors could be different (see my answer). Maybe `survey` uses the additional function of `simfun` in the binomial family. – miguelmorin Mar 21 '19 at 16:03