10

I was trying out linear regression with R using categorical attributes and observe that I don't get a coefficient value for each of the different factor levels I have.

Please see my code below, I have 5 factor levels for states, but see only 4 values of co-efficients.

> states = c("WA","TE","GE","LA","SF")
> population = c(0.5,0.2,0.6,0.7,0.9)
> df = data.frame(states,population)
> df
  states population
1     WA   0.5
2     TE   0.2
3     GE   0.6
4     LA   0.7
5     SF   0.9
> states=NULL
> population=NULL
> lm(formula=population~states,data=df)

Call:
lm(formula = population ~ states, data = df)

Coefficients:
(Intercept)     statesLA     statesSF     statesTE     statesWA  
        0.6          0.1          0.3         -0.4         -0.1

I also tried with a larger data set by doing the following, but still see the same behavior

for(i in 1:10)
{
    df = rbind(df,df)
}

EDIT : Thanks to responses from eipi10, MrFlick and economy. I now understand one of the levels is being used as reference level. But when I get a new test data whose state's value is "GE", how do I substitute in the equation y=m1x1+m2x2+...+c ?

I also tried flattening out the data such that each of these factor levels gets it's separate column, but again for one of the column, I get NA as coefficient. If I have a new test data whose state is 'WA', how can I get the 'population value'? What do I substitute as it's coefficient?

> df1

population GE MI TE WA 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0

lm(formula = population ~ (GE+MI+TE+WA),data=df1)

Call:
lm(formula = population ~ (GE + MI + TE + WA), data = df1)

Coefficients:
(Intercept)           GE           MI           TE           WA  
          1            1            0            1           NA  
tubby
  • 2,074
  • 3
  • 33
  • 55
  • 3
    The coefficient for `states="GE"` is the intercept. In a model with an intercept, one of the factor levels has to be the "reference" level. All of the other coefficients for `states` are relative to `"GE"`. – eipi10 May 11 '15 at 21:31
  • 1
    If you want a different level to be the reference level you can use `relevel`: `df$states = relevel(df$states, ref = "LA")`. – eipi10 May 11 '15 at 21:36
  • You could also fit an intercept free model: `lm(formula = population ~ states-1, data = df)` – MrFlick May 11 '15 at 21:45
  • @MrFlick , thanks But when I get a new test data whose state's value is "GE", how do I substitute in the equation y=m1x1+m2x2+...+c ? Please see the edit to my question above. – tubby May 11 '15 at 22:39
  • @eipi10, thanks, but again I'm confused as to what I should substitute as coefficient for a data whose state value is the one that's not give. Please see my edit above. – tubby May 11 '15 at 22:40
  • 1
    If you fit the intercept-free model, you see that all the states get a coefficient. Or if you use the standard reference-level coding you would still get a value for GE in the intercept. If you want to predict new values, both methods would work fine with `predict()`. Your "solution" of creating indicator variables for all states is invalid because your model is over specified and therefore un-estimable. This is a basic feature of regression with categorical variables. You might want to pick up a basic statistics text book to learn more. This is not a programming question anymore. – MrFlick May 12 '15 at 00:53
  • 2
    The intercept is the coefficient for the reference category. So, for your first regression, when state = GE (the reference category), population = 0.6 (the coefficient for the intercept). Every other prediction for population is equal to 0.6 + state, where "state" is the coefficient for a given state. For example, the predicted population of LA is 0.6 + 0.1 = 0.7. In the case of this regression, the predictions are exact, because your data has only a single observation for each state. – eipi10 May 12 '15 at 01:02
  • My [answer to this question](http://stackoverflow.com/questions/26783529/dummy-variables-for-logistic-regression-in-r) might be helpful for illustration. It deals with logistic regression, but the issue is the same. – eipi10 May 12 '15 at 01:04

1 Answers1

6

GE is dropped, alphabetically, as the intercept term. As eipi10 stated, you can interpret the coefficients for the other levels in states with GE as the baseline (statesLA = 0.1 meaning LA is, on average, 0.1x more than GE).

EDIT:

To respond to your updated question:

If you include all of the levels in a linear regression, you're going to have a situation called perfect collinearity, which is responsible for the strange results you're seeing when you force each category into its own variable. I won't get into the explanation of that, just find a wiki, and know that linear regression doesn't work if the variable coefficients are completely represented (and you're also expecting an intercept term). If you want to see all of the levels in a regression, you can perform a regression without an intercept term, as suggested in the comments, but again, this is ill-advised unless you have a specific reason to.

As for the interpretation of GE in your y=mx+c equation, you can calculate the expected y by knowing that the levels of the other states are binary (zero or one), and if the state is GE, they will all be zero.

e.g.

y = x1b1 + x2b2 + x3b3 + c
y = b1(0) + b2(0) + b3(0) + c
y = c

If you don't have any other variables, like in your first example, the effect of GE will be equal to the intercept term (0.6).

economy
  • 4,035
  • 6
  • 29
  • 37
  • 1
    I'm not sure if the last sentence is accurate. I don't think the baseline is identically equal to the intercept. There was another solution I've seen for when you have to know the baseline and I'm searching for it now. I'll update this comment if I find it. – Hack-R Mar 03 '17 at 18:19
  • 1
    Another Q & A on Stack Overflow for this issue is: [`lm` summary not display all factor levels](https://stackoverflow.com/q/41032858/4891738). Both Q & A serve as duplicate targets for questions on this FAQ. – Zheyuan Li Jul 15 '22 at 18:08