0

I am confused with the answer from Logistic regression - defining reference level in R

It said if you want to predict the probability of "Yes", you set as relevel(auth$class, ref = "YES"). However, in my experiment, if we have a binary response variable with "0" and "1". We only get the estimation for probability of "1" when we set relevel(factor(y),ref="0").

n <- 200
x <- rnorm(n)
sumx <- 5 + 3*x
exp1 <- exp(sumx)/(1+exp(sumx))
y <- rbinom(n,1,exp1) #probability here is for 1
model1 <- glm(y~x,family = "binomial")
summary(model1)$coefficients
            Estimate Std. Error  z value     Pr(>|z|)
(Intercept) 5.324099  1.0610921 5.017565 5.233039e-07
x           2.767035  0.7206103 3.839849 1.231100e-04
model2 <- glm(relevel(factor(y),ref="0")~x,family = "binomial")
summary(model2)$coefficients
            Estimate Std. Error  z value     Pr(>|z|)
(Intercept) 5.324099  1.0610921 5.017565 5.233039e-07
x           2.767035  0.7206103 3.839849 1.231100e-04

So what is my mistake? Actually, what is glm() to predict in default if we use response other than "0" and "1"?

Community
  • 1
  • 1
David Lee
  • 129
  • 1
  • 9
  • If `P(0)` is the probability of 0, `P(1)` is the probability of 1, then `P(0) = 1 - P(1)`. Thus, you can always calculate the probability of the reference level, regardless of which level you set as the reference. – eipi10 Apr 15 '16 at 02:57
  • @eipi10, yes, I understand that. I just want to know is there misunderstanding for me to the original answer? I think if we want to get probability of "Yes", we should set relevel(auth$class, ref = "No"), am I correct? And what is reference level here means? – David Lee Apr 15 '16 at 03:02
  • That's correct. But you could also set the reference level to "Yes" and you could still get the probability of "Yes" by doing 1 - P(No). The reference level just means that the model is predicting the probability of the other (non-reference) level. – eipi10 Apr 15 '16 at 03:10

1 Answers1

4

If P(0) is the probability of 0 and P(1) is the probability of 1, then P(0) = 1 - P(1). Thus, you can always calculate the probability of the reference level, regardless of which level you set as the reference.

For example, predict(model1, type="response") gives you the probability of the non-reference level. 1 - predict(model1, type="response") gives you the probability of the reference level.

You also asked, "what is glm() to predict in default if we use response other than '0' and '1'." For (binomial) logistic regression to be appropriate, your outcome needs to be a categorical variable with two categories. You can call them whatever you want, 0/1, black/white, because/otherwise, Mal/Serenity, etc. One will be the reference level--whichever you prefer--and the model will give you the probability of the other level. The probability of the reference level is just 1 minus the probability of the other level.

If your outcome has more than two categories, you can use a multinomial logistic regression model, but the principle is similar.

eipi10
  • 91,525
  • 24
  • 209
  • 285