1

I'm using logistic regression to predict a binary outcome variable (Group, 0/1). So I've noticed something: I have two variable representing the same outcome, one is coded simply as "0" or "1".

> df$Group   
>[1] 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1
> 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 
> [59] 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1
> 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 
>[117] 0 0 0 1 1 1 1
> 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 0
> 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 
>[175] 1 0 1 
>Levels: 0 1
> is.factor(df$Group)   
> [1] TRUE

Same story for the other one which represents the same thing, but has "names" labels:

> df$Group   
>[1] CON CI     CON CI     CI     CON CI    
> CI     CON CI     CI     CI     CON CI      
>[15] CI  ecc.. ecc..    
> Levels: CI CON  
> is.factor(df$Group2)  
> [1] TRUE  
> contrasts(df$Group2)    
> CI        0  
> CON       1

In which 0 in the first variable =CON, whereas 1=CI. I created that first numerical variable because I wanted CI to be my "1" group, and CON the 0 reference group, but when I was transforming from the dataset, each time I tried to do "as.factor" what happened was CI=level 1, CON = level 2.

I thought they were the same thing, but when I tried to plot the odds ratio using sjPlot package, and just checked to be sure, I noticed that the OR were quite different, although by inspecting the coefficients of summary(glm model), everything seemed the same(apart from -or + of estimates, which makes sense as the two groups are coded differently). Specifically, when using the numerical variable the plotted OR are definitely bigger, whereas when using the "name" variable, the OR are smaller.

Am I missing something in the understanding of r (I'm self-thought) or in computation of logistic regression? Which one of the variables should I use in logistic regression? And how could I change the fact that in the "name" variables r uses "CI" as 0 reference group instead of CON? Thank you.

Roman
  • 17,008
  • 3
  • 36
  • 49
  • it would be much easier if you provide a full reproducible example with full code including also the `glm`'s and the different outputs. e.g. have you specified the familiy? check `?family` – Roman May 05 '20 at 09:28
  • Hi Roman, thank you for the answer. Yes I specified the family when computing the model i.e. glm(y ~ x, family= binomial, data= df). – WannabeGandalf May 05 '20 at 09:43
  • here you can find some instructions for reproducible examples https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – desval May 05 '20 at 09:50
  • 1
    Does this answer your question? [Logistic regression - defining reference level in R](https://stackoverflow.com/questions/23282048/logistic-regression-defining-reference-level-in-r) – Roman May 05 '20 at 10:20

1 Answers1

0

By default, R uses alphabetical order for levels of factor. You can set your own order simply by

df$Group <- factor(df$Group, levels=c('CON','CI'))

Then CON would be used as reference level in logistic regression and you should get the same results as with 0/1 coding.

Łukasz Deryło
  • 1,819
  • 1
  • 16
  • 32
  • Thank you Lukasz. Could you explain a bit more why it changes if I simply modify the reference group? Shouldn't the OR be the same despite the order of the levels in a variable? – WannabeGandalf May 05 '20 at 09:40
  • This should help you: https://stats.stackexchange.com/questions/401120/interpretation-of-logistic-regression-coefficients-when-there-are-multiple-level – Łukasz Deryło May 05 '20 at 09:54
  • Thank you very much guys, I definitely need to look more thoroughly at the theory behind logistic regression, and you've been of much help. – WannabeGandalf May 05 '20 at 10:46