28

I am running the summary(lm(...)) function in R. When I print the coefficients I get estimates for all variables except the last variable. The last variable I get "NA".

I tried switching the last column of data with another column and again, whatever was in the last column got "NA", but everything else got estimates.

A little bit about the data: I have about 5 variables with data in every row and then I have 12 seasonal variables that, for example, if the month is january there is a 1 for every day in january, 0 otherwise. For february variable there is a 1 if month is february and 0 otherwise and so on. Does anyone know what would produce "NA" in the last column of the coefficient estimate? So the first time I ran it, it was the coefficient for the December dummy variable. Is it because of my monthly dummy variables? Thanks

This is my reproducible example.

dat<- data.frame(
         one<-c(sample(1000:1239)),
         two<-c(sample(200:439)),
         three<-c(sample(600:839)),
         Jan<-c(rep(1,20), rep(0,220)),
         Feb<-c(rep(0,20),rep(1,20),rep(0,200)),
         Mar<-c(rep(0,40),rep(1,20),rep(0,180)),
         Apr<-c(rep(0,60),rep(1,20),rep(0,160)),
         May<-c(rep(0,80),rep(1,20),rep(0,140)),
         Jun<-c(rep(0,100),rep(1,20),rep(0,120)),
         Jul<-c(rep(0,120),rep(1,20),rep(0,100)),
         Aug<-c(rep(0,140),rep(1,20),rep(0,80)),
         Sep<-c(rep(0,160),rep(1,20),rep(0,60)),
         Oct<-c(rep(0,180),rep(1,20),rep(0,40)),
         Nov<-c(rep(0,200),rep(1,20),rep(0,20)),
         Dec<-c(rep(0,220),rep(1,20)
      )

attach(dat)

summary(lm(one ~ two + three + Jan + Feb + 
          Mar + Apr + May + Jun + Jul + Aug + Sep + Oct + Nov + Dec))
sertsedat
  • 3,490
  • 1
  • 25
  • 45
J M
  • 369
  • 2
  • 4
  • 10
  • 2
    Let's start with getting a reproducible example: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Brandon Bertelsen Sep 07 '11 at 17:14
  • 5
    the number of dummy variables is always 1 less than the number of factors. so in your case if there are 12 months, you should define 11 dummies. you are probably defining 12 which is why the last one is not estimated. – Ramnath Sep 07 '11 at 17:28
  • That is right Ramnath. I am using 12. Why do we use 1 less? I am not clear on how that would work as my data is for every day of the year. So there will be one month that simply doesnt get a dummy? – J M Sep 07 '11 at 17:39
  • If you keep `month` as a factor and use a formula containing `month-1` (e.g. `y~month-1`), R will set up the dummy variables for you and suppress the intercept ... if you provide a reproducible example I (or someone else) will show you how it works – Ben Bolker Sep 07 '11 at 17:41
  • Ok added a reproducible example. – J M Sep 07 '11 at 20:18
  • @BenBolker I did not follow your comment i tested the example i added to my post. It is doing exactly the same thing as my real problem. I really need that last coefficent. – J M Sep 07 '11 at 21:41

2 Answers2

49

You have to think a bit more about how your model is defined.

Here's your approach (edited for readability):

> set.seed(101)
> dat<-data.frame(one=c(sample(1000:1239)),
                 two=c(sample(200:439)),
                 three=c(sample(600:839)),
                 Jan=c(rep(1,20),rep(0,220)),
                 Feb=c(rep(0,20),rep(1,20),rep(0,200)),
                 Mar=c(rep(0,40),rep(1,20),rep(0,180)),
                 Apr=c(rep(0,60),rep(1,20),rep(0,160)),
                 May=c(rep(0,80),rep(1,20),rep(0,140)),
                 Jun=c(rep(0,100),rep(1,20),rep(0,120)),
                 Jul=c(rep(0,120),rep(1,20),rep(0,100)),
                 Aug=c(rep(0,140),rep(1,20),rep(0,80)),
                 Sep=c(rep(0,160),rep(1,20),rep(0,60)),
                 Oct=c(rep(0,180),rep(1,20),rep(0,40)),
                 Nov=c(rep(0,200),rep(1,20),rep(0,20)),
                 Dec=c(rep(0,220),rep(1,20)))
> summary(lm(one ~ two + three + Jan + Feb + Mar + Apr + 
         May + Jun + Jul + Aug + Sep + Oct + Nov + Dec,
            data=dat))

And the answers:

[snip]
Coefficients: (1 not defined because of singularities)

note this line, it indicates that R (and any other statistical package you choose to use) can't estimate all the parameters because the predictor variables are not all linearly independent.

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1149.55556   53.52499  21.477   <2e-16 ***

The intercept here represents the predicted value when all predictor variables are zero. In any particular case the interpretation of the intercept depends on how you have parameterized your model. The dummy variables you have defined for month are not all linearly independent; lm is smart enough to detect this and drop some of the unidentifiable (linearly dependent) predictor variables. The details of which particular predictor(s) are discarded in this case are obscure and technical (you would probably have to look inside the lm.fit function, but you probably don't want to do this). In this case, R decides to throw away the December predictor. Therefore, if we set all the predictors (two, three, and all month dummies Jan-Nov) to zero, we end up with the expected value when two=0 and three=0 and when the month is not equal to any of Jan-Nov -- i.e., the expected value for December.

two           -0.09670    0.06621  -1.460   0.1455    
three          0.02446    0.06666   0.367   0.7141    
Jan          -19.49744   22.17404  -0.879   0.3802    
Feb          -28.22652   22.27438  -1.267   0.2064    
Mar           -6.05246   22.25468  -0.272   0.7859    
Apr           -5.60192   22.41204  -0.250   0.8029    
May          -13.19127   22.34289  -0.590   0.5555    
Jun          -19.69547   22.14274  -0.889   0.3747    
Jul          -44.45511   22.20837  -2.002   0.0465 *  
Aug           -2.08404   22.26202  -0.094   0.9255    
Sep          -10.13351   22.10252  -0.458   0.6470    
Oct          -31.80482   22.33335  -1.424   0.1558    
Nov          -20.35348   22.09953  -0.921   0.3580    
Dec                 NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 69.81 on 226 degrees of freedom
Multiple R-squared: 0.04381,    Adjusted R-squared: -0.01119 
F-statistic: 0.7966 on 13 and 226 DF,  p-value: 0.6635 

Now do it again, this time setting up a model formula that uses -1 to discard the intercept term (we reset the random seed for reproducibility):

> set.seed(101)
> dat1 <- data.frame(one=c(sample(1000:1239)),two=c(sample(200:439)),
      three=c(sample(600:839)),
                    month=factor(rep(month.abb,each=20),levels=month.abb))
> summary(lm(one ~ two + three + month-1, data=dat1))

    Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
two        -0.09670    0.06621  -1.460    0.146    
three       0.02446    0.06666   0.367    0.714    

The estimates for two and three are the same as before.

monthJan 1130.05812   52.79625  21.404   <2e-16 ***
monthFeb 1121.32904   55.18864  20.318   <2e-16 ***
monthMar 1143.50310   53.59603  21.336   <2e-16 ***
monthApr 1143.95365   54.99724  20.800   <2e-16 ***
monthMay 1136.36429   53.38218  21.287   <2e-16 ***
monthJun 1129.86010   53.85865  20.978   <2e-16 ***
monthJul 1105.10045   54.94940  20.111   <2e-16 ***
monthAug 1147.47152   54.57201  21.027   <2e-16 ***
monthSep 1139.42205   53.58611  21.263   <2e-16 ***
monthOct 1117.75075   55.35703  20.192   <2e-16 ***
monthNov 1129.20208   53.54934  21.087   <2e-16 ***
monthDec 1149.55556   53.52499  21.477   <2e-16 ***

The estimate for December is the same as the intercept estimate above. The other months' parameter estimates are equal to (intercept+previous value). The p values are different, because their meaning has changed. Previously, they were a test of differences of each month from December; now they are a test of the differences of each month from a baseline value of zero.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • ok thank you for the detailed explanation. could you rephrase this sentence as I am struggling with it "You have to think about what that means in any particular case, but because R decides to throw away the December predictor (it has to throw something away), that corresponds to December when two=three=0." – J M Sep 08 '11 at 05:32
  • 3
    I've tried again. If this doesn't do it you may need to read further. There is a fairly technical post at http://rip94550.wordpress.com/2011/01/17/regression-1-linear-dependence-or-exact-multicollinearity/ ; you should look for keywords "rank" and "multicollinearity" (and "linear (in)dependence") – Ben Bolker Sep 08 '11 at 12:53
18

You are getting NA for the last variable because it is linearly dependent on the other 11 variables. R's lm function (and all properly constructed R regression functions as well) will automatically exclude linearly dependent variables for you. That's handled in the model.matrix function. If all of the other variables are 0, then December will be 1. It is related to the exclusion of the lowest term of a factor, but not exactly the same.

There are probably better ways to do this.

As for where to get the information from December? ... It's in the '(Intercept)' term. If you want to have all of the levels labeled as you expect them, try adding either -1 or +0 to the formula and you will see December emerge magically from the mists.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • so all I am interested in is getting the coefficients. why does R exclude the last coefficient for me? it must be the case that I dont need it but I do not understand why. – J M Sep 07 '11 at 20:52
  • 5
    because it is linearly dependent on the previous 11 terms. If you know that an observation is *not* in January-November then saying that it's in December gives you no additional information ... I think you have to forgive people for trying to lecture you on the statistical underpinnings of the models, because it's the best way for you to understand (and answer your own questions in the future) – Ben Bolker Sep 07 '11 at 22:42
  • @BenBolker what if there are two or more types of binary independent variables? There would be multiple NA values from lm.fit() and the intercept value would be the sum of the NA values excluded. For example, the month (12 binary variables) and year (3 binary variables representing 3 years). – Scott Davis May 27 '15 at 18:07
  • 3
    Binary or multinomial variables are all coded as factors and have "treatment contrasts" which by default results in the "(Intercept)" coefficient representing the effect (being the group mean for an identity link) intersection of all such variables at their lowest level. If you had enough data you could still create a model with interactions and no intercept (using the methods described above) which would summarize (as the mean) all of the two-way or higher combinations. – IRTFM May 27 '15 at 18:21
  • Thank you @BondedDust I just noticed BenBolker's example was for factored categorical variables. If month was broken down into 12 binary variables (Jan 0 = no, 1 = Yes, Feb = 0, 1 = Yes, ect.), how can I fix the multiple NAs? If I had the same dataset as the OP, I would get the NA value for Dec. – Scott Davis May 27 '15 at 19:44
  • This is not adequately described to answer and I doubt can be answered in comments anyway. Probably need to post a new question with full output and don't forget to start with full description of data. – IRTFM May 27 '15 at 20:31