5

There is something I do not understand in model.matrix. When I enter a single binary variable without an intercept it returns two levels.

> temp.data <- data.frame('x' = sample(c('A', 'B'), 1000, replace = TRUE))
> temp.data.table <- model.matrix( ~ 0 + x, data = temp.data)
> head(temp.data.table)
  xA xB
1  1  0
2  0  1
3  0  1
4  0  1
5  1  0
6  0  1

However, when I enter another binary level, it creates only 3 columns. Why is that? What makes the behavior of the function suddenly different? and how can I avoid it?

> temp.data <- data.frame('x' = sample(c('A', 'B'), 1000, replace = TRUE),
+                         'y' = sample(c('J', 'D'), 1000, replace = TRUE))
> temp.data.table <- model.matrix( ~ 0 + x + y, data = temp.data)
> head(temp.data.table)
  xA xB yJ
1  0  1  0
2  0  1  1
3  0  1  1
4  0  1  0
5  1  0  1
6  0  1  0
Kozolovska
  • 1,090
  • 6
  • 14

2 Answers2

6

You need to work with factors and set the contrasts to FALSE. Try this:

n <- 10
temp.data <- data.frame('x'=sample(c('A', 'B'), n, replace=TRUE),
                        'y'=factor(sample(c('J', 'D'), n, replace=TRUE)))
model.matrix( ~ 0 + x + y, data=temp.data,
              contrasts=list(y=contrasts(temp.data$y, contrasts=FALSE)))

#    xA xB yD yJ
# 1   0  1  1  0
# 2   1  0  0  1
# 3   0  1  1  0
# 4   1  0  0  1
# 5   0  1  0  1
# 6   1  0  1  0
# 7   1  0  1  0
# 8   0  1  1  0
# 9   0  1  0  1
# 10  0  1  1  0
# attr(,"assign")
# [1] 1 1 2 2
# attr(,"contrasts")
# attr(,"contrasts")$x
# [1] "contr.treatment"
# 
# attr(,"contrasts")$y
#   D J
# D 1 0
# J 0 1

To understand why this happens, try:

contrasts(temp.data$y)
#   J
# D 0
# J 1

contrasts(temp.data$y, contrasts=F)
#   D J
# D 1 0
# J 0 1

With your x variable this happens automatically by setting 0 + to remove the intercept. (Actually x also should be coded as factor).

The reason is, that in linear regression the levels of factor variables are usually compared to a reference level (which you could change using relevel). In your model matrix, with 0 + you remove the intercept for your first variable but not to the following (try model.matrix( ~ 0 + y + x, data=temp.data) where you get only one x but to y). This is determined in the standard contrasts setting using treatment contrasts by default.

You may want to read a relevant post of Rose Maier (2015) explaining this in great detail:

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Why does it happen? I understand the idea of encoding a factor with k levels to k - 1 columns. But if so, why not do it to every factor?. If we remove the intercept, why does it not map each factor to the number of levels. I'm struggling with the logic behind it. – Kozolovska May 31 '20 at 13:18
  • Thanks, I am not sure that you'll have the answer, but why is it set to behave that certain way? (the difference between using one categorical variable without intercept to two categorical variables). – Kozolovska May 31 '20 at 13:30
  • 1
    @Kozolovska _Why_ is rather a philosophical question :) But that's how it is most commonly needed, thus most general. I've added some more explanation on the logic in my answer. – jay.sf May 31 '20 at 13:43
  • 1
    Your help is much appreciated. I get it know more or less, I think this is somewhat a strange behavior from model.matrix, as I think if I do not want an intercept it should imply that all factors should not have one. – Kozolovska May 31 '20 at 13:50
  • I think the reason for this behavior is to ensure the design matrix is full rank (when the input variables are also linearly independent). Since `model.matrix` is intended to construct the design matrix of a linear model, this makes sense as default behavior. – jackkamm Jul 07 '21 at 18:02
1

You need to reset the contrasts of the factor variables. See this post.

temp.data <- data.frame('x' = sample(c('A', 'B'), 1000, replace = TRUE),
+                         'y' = sample(c('J', 'D'), 1000, replace = TRUE))

dat = model.matrix(~ -1 +., data=temp.data, contrasts.arg = lapply(temp.data[,1:2], contrasts, contrasts=FALSE))
head(dat)

  xA xB yD yJ
1  0  1  0  1
2  1  0  0  1
3  1  0  0  1
4  1  0  0  1
5  0  1  1  0
6  0  1  0  1
Peter
  • 2,120
  • 2
  • 19
  • 33