Is there any way to fit a `glm()` so that all levels are included (i.e. no reference level)?

Question

Consider the code:

x <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat",
                header=TRUE)[,1:2]

fit <- glm(education ~ age, family="binomial", data=x)

summary(fit)

Where age has 4 levels: "<25" "25-29" "30-39" "40-49"

The results are:

enter image description here

So by default, one of the levels is used as a reference level. Is there a way to have glm output coefficients for all 4 levels + the intercept (i.e. have no reference level)? Software packages like SAS do this by default, so I was wondering if there was any option for this.

Thanks!

My understanding is that the coefficient for the reference level is just 0. — esa606, Apr 28 '14 at 17:33
You are looking for something called `effect coding`. See [here](http://stats.stackexchange.com/questions/52132/how-to-do-regression-with-effect-coding-instead-of-dummy-coding-in-r) — coffeinjunky, Apr 28 '14 at 17:33
does SAS really output intercept *and* all four levels? It doesn't really make sense to *parameterize* a model with N+1 parameters when there are only N levels, but you can output *predictions* of the model for all four levels with e.g. the `lsmeans` package. It might be helpful to show the SAS output for the corresponding model, illustrating what you want. — Ben Bolker, Apr 28 '14 at 17:54

score 3 · Answer 1 · answered Apr 28 '14 at 17:38

3

See ?formula, specifically, the meaning of including + 0 in your model specification...

# Sample data - explanatory variable (continuous)
x <- runif( 100 )
# explanatory data, factor with 3 levels
f <- as.factor( sample( 3 , 100 , TRUE ) )
# outcome data
y <- runif( 100 ) + rnorm(100) + rnorm( 100 , mean = c(1,3,6) )

# model without intercept
summary( glm( y ~ x + f + 0 ) )
#Call:
#glm(formula = y ~ x + f + 0)

#Deviance Residuals: 
#    Min       1Q   Median       3Q      Max  
#-5.7316  -1.8923   0.0195   1.8918   5.9520  

#Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
#x    0.3216     0.9772   0.329    0.743    
#f1   3.4493     0.6823   5.055 2.06e-06 ***
#f2   3.6349     0.6959   5.223 1.02e-06 ***
#f3   3.1962     0.6598   4.844 4.87e-06 ***

answered Apr 28 '14 at 17:38

Simon O'Hanlon

58,647
14
142
184

So what would the intercept be if you include all the levels? Look at the parameter differences when running `+0` and `+1` (i.e. with intercept)... – Simon O'Hanlon Apr 28 '14 at 17:48
I don't think I've seen "+0" used for this, I think the usual method is "-1". – Spacedman Apr 28 '14 at 17:53
@Spacedman Again, I refer you to `?formula`! *A model with no intercept can be also specified as `y ~ x + 0` or `y ~ 0 + x`.* :-) – Simon O'Hanlon Apr 28 '14 at 17:55
Yes, I wasn't questioning its correctness, just its frequency! FWIW `x-0` is different to `x+0` which disagrees with the mathematician in me, and any other constant generates an error! – Spacedman Apr 28 '14 at 17:58
@Spacedman ok, ok! I prefer `+ 0` myself, because I think of it as `+ 0*b0` i.e. the model must run through the origin so account for all other parameters if that makes sense? – Simon O'Hanlon Apr 28 '14 at 18:00

score 1 · Answer 2 · edited May 23 '17 at 12:31

1

You'll want to use the model.matrix function to convert the factors in the age variable to binary variables.

See this answer.

EDIT: Here is an example:

x <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat",
                header=TRUE)[,1:2]
binary_variables <- model.matrix(~ x$age -1, x)
fit <- glm(x$education ~ binary_variables, family="binomial")
summary(fit)

edited May 23 '17 at 12:31

Community

1
1

answered Apr 28 '14 at 17:35

theWanderer4865

861
13
20

3

`glm` does this anyway when using factor variables. I don't see how this helps. Just include `-1` or `+0` in the model spec and this will be taken care of automagically. – Simon O'Hanlon Apr 28 '14 at 17:49
You are absolutely right, I had no idea glm did that. Please forgive my ignorance. – theWanderer4865 Apr 28 '14 at 17:59
it can still be useful to know how to do this in some contexts (e.g. sometimes you want to drop derived (input) variables from the model in a way that is (AFAIK) impossible to specify via a formula ... – Ben Bolker Apr 28 '14 at 18:03
@theWanderer4865 Sorry that wasn't meant to come across as rude! I just think you should be aware it's unnecessary for convenience, but Ben is absolutely right and sometimes it is good to know how the model is working 'under the hood' as it were so you understand what it *means*. So in that context it *is* useful. – Simon O'Hanlon Apr 28 '14 at 18:05
@SimonO'Hanlon, I think you are wrong, becasue your suggestion doesn't seem to include the intercept (which the OP wants), while this solution includes the intercept too – David Arenburg Apr 28 '14 at 18:07
@DavidArenburg The intercept in that case is `0` to the best approximation that R can muster (5e-16!!!) due to bad dataset (which is mentioned in the summary). The effect of `-1` or `+0` is to force the line through the origin. Run `model.matrix` with and without `-1` and look at `binary_variables`. – Simon O'Hanlon Apr 28 '14 at 18:15
@SimonO'Hanlon, you are right, I didn't notice that it's 0, but would it be always 0, i.e., for a better data set? – David Arenburg Apr 28 '14 at 18:19
AFAIK, due to the effect of what `-1` is actually doing it should *better* converge on 0 with more data as that is in effect what it is *supposed* to be doing. @BenBolker knows for sure though. – Simon O'Hanlon Apr 28 '14 at 18:21
You are, right, I didn't know that `-1` syntax takes out the intercept. Actually I wasn't attacking you, but aksing questions out of interest. I never used the `model.matrix` function before so just was qurious – David Arenburg Apr 28 '14 at 18:30
It surprises me a little that `lm` doesn't detect the overspecified model in this case. The answer's actually arbitrary; it's like fitting a model `Y=a0+a1+b*x` -- any linear combination of `a0` and `a1` that add up to the true intercept will be equally good fits. – Ben Bolker Apr 28 '14 at 21:43
I'm now a little sorry I upvoted (since now that I look more closely I see that the model is overfitted): I would prefer that you use `+0` or `-1` in your `lm` call to suppress the intercept, since otherwise the results are very likely to be misleading. I don't seem to be able to retract my upvote without downvoting ... – Ben Bolker Apr 28 '14 at 21:44

Is there any way to fit a `glm()` so that all levels are included (i.e. no reference level)?

2 Answers2

Linked