0

When I include the factor1, factors2, and its interaction, the interaction term has the combination of each's base level as its base level. However, if I include interaction term only (factor1:factor2 instead of factor1*factor2), the combination of last level of both is used as reference (i.e. this row has "NA" for estimate, std error etc). I have checked multiple times that each factor has the right base level configured before building the model. Is there a way to make the combination of each's first level to be the reference? Thanks!

hahahut
  • 94
  • 7
  • Please add a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample data and the code you are running so we can more clearly see the problem and test possible solutions. – MrFlick Aug 12 '16 at 18:39

1 Answers1

0

Let's look at what's going on here.

(dd <- expand.grid(f1=letters[1:2],f2=LETTERS[1:2]))
##   f1 f2
## 1  a  A
## 2  b  A
## 3  a  B
## 4  b  B

Add a response variable:

dd2 <- data.frame(dd,y=c(1,2,3,5))

Use model.matrix() to look at what dummy variables get constructed.

data.frame(dd,model.matrix(~f1*f2,data=dd),check.names=FALSE)
##   f1 f2 (Intercept) f1b f2B f1b:f2B
## 1  a  A           1   0   0       0
## 2  b  A           1   1   0       0
## 3  a  B           1   0   1       0
## 4  b  B           1   1   1       1

So the baseline (intercept) is the a:A combination; the f1b parameter is the a-b contrast when f2==A; the f2B parameter is the A-B contrast when f1==a; and the interaction is the contrast between bB-aA, given the additive expectation.

If we add the interaction explicitly, R doesn't know to drop the intercept column. In this overparameterized model matrix, there isn't really a "baseline" level, but when there is rank-deficiency, R does drop the last column by default, so you end up effectively getting bB as your baseline (since the bB row of the matrix is [1 0 0 0] if we drop the last column).

data.frame(dd,X3 <- model.matrix(~f1:f2,data=dd),check.names=FALSE)
## f1 f2 (Intercept) f1a:f2A f1b:f2A f1a:f2B f1b:f2B
## 1  a  A           1       1       0       0       0
## 2  b  A           1       0       1       0       0
## 3  a  B           1       0       0       1       0
## 4  b  B           1       0       0       0       1

If you want to use a specified model matrix, you can cheat and do this directly. You have to remember that if you don't specify -1 in the formula, R will automatically re-add an intercept column, so here we get rid of the first two columns (y~. says "use all the variables in the data frame, except the response variable, as predictors").

dd3 <- data.frame(y=dd2$y,X3[,-(1:2)])
coef(lm(y~.,data=dd3))

Looking at the model matrix above but leaving out the second column, we interpret this as:

  • (Intercept) ([1 0 0 0]) is the value of a-A
  • f1b:f2A ([1 1 0 0]) is the a-b contrast when f2=A
  • f1a:f2B ([1 0 1 0]) is the A-B contrast when f1=a
  • the interaction is now the straight contrast between b-B and a-A, but uncorrected for the additive effects. Is that really what you wanted?
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453