2

I'm doing binary logistic regression in R, and some of the independent variables represent ordinal data. I just want to make sure I'm doing it correctly. In the example below, I created sample data and ran glm() based on the assumption that the independent variable "I" represents continuous data. Then I ran it again using ordered(I) instead. The results came out a little bit differently, so it seems like a successful test. My question is whether it's doing what I think it's doing...e.g., it's seeing the integer data, coercing it to ordinal data based on the values of the integers, and running the glm() with a different formula to account for the idea that the distance between "1," "2," "3," etc. may not be the same, hence making it "correct" if this represents ordinal data. Is that correct?

> str(gorilla)
'data.frame':   14 obs. of  2 variables:
 $ I: int  1 1 1 2 2 2 3 3 4 4 ...
 $ D: int  0 0 1 0 0 1 1 1 0 1 ...
> glm.out = glm(D ~ I, family=binomial(logit), data=gorilla)
> summary(glm.out)

...tried it again with ordered:

glm.out = glm(D ~ ordered(I), family=binomial(logit), data=gorilla)

 > summary(glm.out)

PS: In case it would help, here's the full output from these tests (one thing I'm noticing is the very large standard error numbers):

Call:
glm(formula = D ~ I, family = binomial(logit), data = gorilla)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7067  -1.0651   0.7285   1.0137   1.4458  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.0624     1.2598  -0.843    0.399
I             0.4507     0.3846   1.172    0.241

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 19.121  on 13  degrees of freedom
Residual deviance: 17.621  on 12  degrees of freedom
AIC: 21.621

Number of Fisher Scoring iterations: 4

> glm.out = glm(D ~ ordered(I), family=binomial(logit), data=gorilla)
> summary(glm.out)

Call:
glm(formula = D ~ ordered(I), family = binomial(logit), data = gorilla)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.66511  -0.90052   0.00013   0.75853   1.48230  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)     3.6557   922.4405   0.004    0.997
ordered(I).L    1.3524     1.2179   1.110    0.267
ordered(I).Q   -9.5220  2465.3259  -0.004    0.997
ordered(I).C    0.1282     1.2974   0.099    0.921
ordered(I)^4   13.6943  3307.5816   0.004    0.997

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 19.121  on 13  degrees of freedom
Residual deviance: 14.909  on  9  degrees of freedom
AIC: 24.909

Number of Fisher Scoring iterations: 17

Data used:

I,D
1,0
1,0
1,1
2,0
2,0
2,1
3,1
3,1
4,0
4,1
5,0
5,1
5,1
5,1    
Nickadoo
  • 104
  • 2
  • 7
  • 1
    Please consider including a *small* [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so we can better understand and more easily answer your question. In particular, it would be useful to see the summary output of both models (`D ~ I` and `D ~ ordered(I)` (and maybe `D ~ factor(I)` for comparison). Short answer, you are essentially doing this right, but you might need some help interpreting the results. – Ben Bolker Jun 26 '14 at 20:18
  • Okay, thanks...I've added the outputs and the small data set that I was using. I also noticed the big standard error numbers when using ordered(). – Nickadoo Jun 26 '14 at 21:54
  • 2
    This example is very helpful. Notice how the coefficient names change when you use the ordered. This is because R uses a different contrast by default with ordinal variables. It uses the [Orthogonal Polynomial Coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#ORTHOGONAL), hence the unusual suffixes. you can expect much larger standard errors when using categorical type data because you essentially have many fewer observations in each group that you are using to make estimates. – MrFlick Jun 26 '14 at 22:01
  • This question appears to be off-topic because it is about a request for statistical tutoring. – IRTFM Jun 27 '14 at 01:02
  • Heh, I retracted my close vote, thinking I would vote to close as a duplicate but have apparently lost my opportunity. This seems essentially that same as: http://stackoverflow.com/questions/14923684/interpreting-the-output-of-glm-for-poisson-regression – IRTFM Jun 27 '14 at 01:08
  • @MrFlick - That's very helpful, thanks. But now I'm a little confused because the link to Orthogonal Polynomial Coding says "used only with an ordinal variable in which the levels are equally spaced." If the levels can be assumed to be equally spaced, why would it need to be considered ordinal? I thought the whole definition of ordinal is that one [can make no assumption about it being evenly spaced](http://en.wikipedia.org/wiki/Ordinal_data). – Nickadoo Jun 27 '14 at 06:49
  • @Nickadoo well, what is the exact regression model you want to run? Choosing appropriate statistical models is off topic for this site, but if you know the exact model you want to run, it's easier to help you find the R code to do it. – MrFlick Jun 27 '14 at 06:52
  • Thanks for the insights. I understand that this forum is only about how to accomplish the task in the software, which was my intent. Only in light of the responses do interpretive issues arise. I'm working with someone using SPSS, where one selects in the GUI if an independent variable is ordinal, continuous, etc., and I'm trying to do the "same thing" in R. I'm dealing with independent variables such as age where the groups have different numbers of years, or education (certificate, bachelor...). Now suppose I decided to use Helmert coding. How would I force R to do that? – Nickadoo Jun 27 '14 at 16:30

0 Answers0