1

I want to use a logistic regression to actually perform regression and not classification.

My response variable is numeric between 0 and 1 and not categorical. This response variable is not related to any kind of binomial process. In particular, there is no "success", no "number of trials", etc. It is simply a real variable taking values between 0 and 1 depending on circumstances.

Here is a minimal example to illustrate what I want to achieve

dummy_data <- data.frame(a=1:10, 
                         b=factor(letters[1:10]), 
                         resp = runif(10))
fit <- glm(formula = resp ~ a + b, 
           family = "binomial",
           data = dummy_data)

This code gives a warning then fails because I am trying to fit the "wrong kind" of data:

In eval(family$initialize) : non-integer #successes in a binomial glm!

Yet I think there must be a way since the help of family says:

For the binomial and quasibinomial families the response can be specified in one of three ways: [...] (2) As a numerical vector with values between 0 and 1, interpreted as the proportion of successful cases (with the total number of cases given by the weights).

Somehow the same code works using "quasibinomial" as the family which makes me think there may be a way to make it work with a binomial glm.

I understand the likelihood is derived with the assumption that $y_i$ is in ${0, 1}$ but, looking at the maths, it seems like the log-likelihood still makes sense with $y_i$ in $[0, 1]$. Am I wrong?

asachet
  • 6,620
  • 2
  • 30
  • 74

2 Answers2

2

This is because you are using the binomial family and giving the wrong output. Since the family chosen is binomial, this means that the outcome has to be either 0 or 1, not the probability value.

This code works fine, because the response is either 0 or 1.

dummy_data <- data.frame(a=1:10, 
                         b=factor(letters[1:10]), 
                         resp = sample(c(0,1),10,replace=T,prob=c(.5,.5)) )

fit <- glm(formula = resp ~ a + b, 
           family = binomial(),
           data = dummy_data)

If you want to model the probability directly you should include an additional column with the total number of cases. In this case the probability you want to model is interpreted as the success rate given the number of case in the weights column.

 dummy_data <- data.frame(a=1:10, 
                         b=factor(letters[1:10]), 
                         resp = runif(10),w=round(runif(10,1,11)))

fit <- glm(formula = resp ~ a + b, 
           family = binomial(),
           data = dummy_data, weights = w)

You will still get the warning message, but you can ignore it, given these conditions:

  1. resp is the proportion of 1's in n trials.

  2. for each value in resp, the corresponding value in w is the number of trials.

Marco De Virgilis
  • 982
  • 1
  • 9
  • 29
  • I know why it is happening, I want to know if I can work around it. After all, the raw output of the glm is in (0, 1) so why not fit the glm to values between 0 and 1? – asachet Oct 23 '18 at 08:21
  • After your edit: I agree that if I had data that was actually binomial, I would not have any problem -- but that's not really helpful :). My data is not drawn from a binomial distribution, I simply have a variable which varies between 0 and 1 depending on circumstances (in particular, there is no meaningful "number of trials"). I will make it clearer in the question. – asachet Oct 23 '18 at 08:30
2

From the discussion at Warning: non-integer #successes in a binomial glm! (survey packages), I think we can solve it by another family function ?quasibinomial().

dummy_data <- data.frame(a=1:10, 
                         b=factor(letters[1:10]), 
                         resp = runif(10),w=round(runif(10,1,11)))

fit2 <- glm(formula = resp ~ a + b, 
           family = quasibinomial(),
           data = dummy_data, weights = w)

enter image description here

Shixiang Wang
  • 2,147
  • 2
  • 24
  • 33