0

Consider the following table :

DB <- data.frame(
  Y =rnorm(6),
  X1=c(T, T, F, T, F, F),
  X2=c(T, F, T, F, T, T)
)
           Y    X1    X2
1  1.8376852  TRUE  TRUE
2 -2.1173739  TRUE FALSE
3  1.3054450 FALSE  TRUE
4 -0.3476706  TRUE FALSE
5  1.3219099 FALSE  TRUE
6  0.6781750 FALSE  TRUE

I'd like to explain my quantitative variable Y by two binary variables (TRUE or FALSE) without intercept.

The argument of this choice is that, in my study, we can't observe X1=FALSE and X2=FALSE at the same time, so it doesn't make sense to have a mean, other than 0, for this level.

With intercept

m1 <- lm(Y~X1+X2, data=DB)
summary(m1)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -1.9684     1.0590  -1.859   0.1600  
X1TRUE        0.7358     0.9032   0.815   0.4749  
X2TRUE        3.0702     0.9579   3.205   0.0491 *

Without intercept

m0 <- lm(Y~0+X1+X2, data=DB)
summary(m0)

Coefficients:
        Estimate Std. Error t value Pr(>|t|)  
X1FALSE  -1.9684     1.0590  -1.859   0.1600  
X1TRUE   -1.2325     0.5531  -2.229   0.1122  
X2TRUE    3.0702     0.9579   3.205   0.0491 *

I can't explain why two coefficients are estimated for the variable X1. It seems to be equivalent to the intercept coefficient in the model with intercept.

Same results

When we display the estimation for all the combinations of variables, the two models are the same.

DisplayLevel <- function(m){
  R <-  outer(
    unique(DB$X1),
    unique(DB$X2),
    function(a, b) predict(m,data.frame(X1=a, X2=b))
  )
  colnames(R) <- paste0('X2:', unique(DB$X2))
  rownames(R) <- paste0('X1:', unique(DB$X1))
  return(R)
}

DisplayLevel(m1)
          X2:TRUE  X2:FALSE
X1:TRUE  1.837685 -1.232522
X1:FALSE 1.101843 -1.968364

DisplayLevel(m0)
          X2:TRUE  X2:FALSE
X1:TRUE  1.837685 -1.232522
X1:FALSE 1.101843 -1.968364

So the two models are equivalent.

Question

My question is : can we just estimate one coefficient for the first effect ? Can we force R to assign a 0 value to the combinations X1=FALSE and X2=FALSE ?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
menubigmax_
  • 61
  • 1
  • 7

1 Answers1

0

Yes, we can, by

DB <- as.data.frame(data.matrix(DB))
## or you can do:
## DB$X1 <- as.integer(DB$X1)
## DB$X2 <- as.integer(DB$X2)

#            Y X1 X2
# 1 -0.5059575  1  1
# 2  1.3430388  1  0
# 3 -0.2145794  0  1
# 4 -0.1795565  1  0
# 5 -0.1001907  0  1
# 6  0.7126663  0  1

## a linear model without intercept
m0 <- lm(Y ~ 0 + X1 + X2, data = DB)

DisplayLevel(m0)
#             X2:1      X2:0
# X1:1  0.15967744 0.2489237
# X1:0 -0.08924625 0.0000000

I have explicitly coerced your TRUE/FALSE binary into numeric 1/0, so that no contrast is handled by lm().

The data appeared in my answer are different to yours, because you did not use set.seed(?) before rnorm() for reproducibility. But this is not a issue here.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • Thanks you, your solution helps me a lot, but I've yet a misunderstanding. When we use directly categorical variables as factors, R builds a design matrix (`?model.matrix`) with 0-1 values. We don't have to recode. I don't understand why we have to use your clever trick, whereas R could automatically look after recoding. Is there an algebraic reason ? – menubigmax_ Jul 01 '16 at 12:47
  • Exactly and that is precisely what I don't understand : why R displays three columns ? Is it an R default ? – menubigmax_ Jul 01 '16 at 12:56
  • Ok it's that I thought. Thank you a lot. I'm a litte disappointed by R on this one... – menubigmax_ Jul 01 '16 at 13:01