3

I am trying to do a regression in R using the lm and the glm function.

My dependent variable is logit transformed data based on proportion of events over non-events within a given time period. So my dependent variable is continuous whereas my independent variable are factor variable or dummies.

I have two independent variables that can take the values of

  • Year i to year m, my YEAR variable
  • Month j to month n, my MONTH variable

The problem is that whenever I run my model as summaries the results April(index 1 for month) and 1998 (index 1 for year) is not within the results... if I change April to let's say "foo_bar", August will be missing...

Please help! This is frustrating me and I simply do not know how to search for a solution to the problem.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
Kasper Christensen
  • 895
  • 3
  • 10
  • 30
  • 1
    That's to be expected. What's the actual problem you're having? – NPE Mar 05 '13 at 19:02
  • Hmm how come? The actual problem is that I just want to assess the influence of my time variables on my events, so that maybe December will have a significant impact on my events. Maybe 20 out of 100 bought something in a toystore during December and the rest of the months only 5 out of 100 bought something... – Kasper Christensen Mar 05 '13 at 19:05
  • This is not an R question and does not belong here. – Ista Mar 05 '13 at 19:15
  • 2
    Related: http://stackoverflow.com/questions/7337761/linear-regression-na-estimate-just-for-last-coefficient/7341074#7341074 – Ben Bolker Mar 05 '13 at 20:45
  • Ista: Why dont you see this as an R question? – Kasper Christensen Mar 09 '13 at 17:17

1 Answers1

12

If R were to create a dummy variable for every level in the factor, the resulting set of variables would be linearly dependent (assuming there is also an intercept term). Therefore, one factor level is chosen as the baseline and has no dummy generated for it.

To illustrate this, let's consider a toy example:

> data <- data.frame(y=c(2, 3, 5, 7, 11, 25), f=as.factor(c('a', 'a', 'b', 'b', 'c', 'c')))
> summary(lm(y ~ f, data))

Call:
lm(formula = y ~ f, data = data)

Residuals:
   1    2    3    4    5    6 
-0.5  0.5 -1.0  1.0 -7.0  7.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    2.500      4.093   0.611   0.5845  
fb             3.500      5.788   0.605   0.5880  
fc            15.500      5.788   2.678   0.0752 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 5.788 on 3 degrees of freedom
Multiple R-squared: 0.7245, Adjusted R-squared: 0.5409 
F-statistic: 3.945 on 2 and 3 DF,  p-value: 0.1446 

As you can see, there are three coefficients (the same as the number of levels in the factor). Here, a has been chosen as the baseline, so (Intercept) refers to the subset of data where f is a. The coefficients for b and c (fb and fc) are the differences between the baseline intercept and the intercepts for the two other factor levels. Thus the intercept for b is 6 (2.500+3.500) and the intercept for c is 19 (2.500+15.500).

If you don't like the automatic choice, you could pick another level as the baseline: How to force R to use a specified factor level as reference in a regression?

Community
  • 1
  • 1
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • Thanks a lot! I would never have come up with that myself. I guess that is just a property of dummy variables or what? – Kasper Christensen Mar 05 '13 at 19:12
  • 1
    @KasperChristensen: I hope the latest edit sheds some more light. – NPE Mar 05 '13 at 19:20
  • 1
    @KasperChristensen: it's a property of matrix algebra. If X doesn't have full column rank (i.e. X has linear dependence, perfect multicollinearity, etc.), X'X is not invertible and you can't estimate the coefficients. – Joshua Ulrich Mar 05 '13 at 19:33
  • 1
    @JoshuaUlrich X'X isn't invertible but you can still get estimates using a generalized inverse. It's just that the estimates aren't unique and you can't really interpret most of the coefficients directly. – Dason Mar 05 '13 at 20:20