How does R's lm function handle factor levels (in C_Cdqrls?)?

Question

Or in other words: which algorithm is used in this case? I guess they use discriminant analysis as discribed e.g. in chapter 4.4. in James et. al. "An Introduction to Statistical Learning with Applications in R"?

After input from comments I could also restate the question as follows:

the first part of the magic appears in ans <- .External2(C_modelmatrix, t, data) (in model.matrix.default) where the model changes according to factor levels => I think I understand this part.
The second part still involves z <- .Call(C_Cdqrls, x, y, tol, FALSE) and I would not expect, that linear regression and discriminant analysis are the same on the maths level. Do I miss something obvious? Again, my stats package is a binary, I do not have access to the source code...

I found quite useful explanations in this article, but at some point it only states

... This [factor] deconstruction can be a complex task, so we will not go into details lest it take us too far afield...

I could not find anything in the documentation and I was not able to understand what's going on using debug(lm) What I understand using a reproducible example:

n <- 10
p <- 6
set.seed(1)
x <- seq(0, 20, length.out = n) + rnorm(n, 0, 1)
y <- c(1:3)
y <- sample(y, n, replace = TRUE)
z <- 10*y*x + 10*y + 10 + rnorm(n, 0, 1)
debug(lm)
fit <- lm(z ~ x*y)

After mt <- attr(mf, "terms") it looks like

mt
# ...
# attr(,"dataClasses")
#         z         x         y 
# "numeric" "numeric" "numeric"

whereas after

n <- 10
p <- 6
set.seed(1)
x <- seq(0, 20, length.out = n) + rnorm(n, 0, 1)
y <- c(1:3)
y <- sample(y, n, replace = TRUE)
z <- 10*y*x + 10*y + 10 + rnorm(n, 0, 1)
y <- as.factor(y)
debug(lm)
fit <- lm(z ~ x*y)

and mt <- attr(mf, "terms") looks like

mt
# ...
# attr(,"dataClasses")
#         z         x         y 
# "numeric" "numeric"  "factor"

But then it seems, they always call lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) and there z <- .Call(C_Cdqrls, x, y, tol, FALSE) which I thought only works without factors. The link above nicely explains everything down to the model matrix and qr-decomposition which I thought does not work in case of factors.

Edit: The model matrix after x <- model.matrix(mt, mf, contrasts) already differs. In case of numerics

x
   (Intercept)          x y       x:y
1            1 -0.6264538 3 -1.879361
2            1  2.4058655 1  2.405866
3            1  3.6088158 2  7.217632
4            1  8.2619475 1  8.261947
5            1  9.2183967 1  9.218397
6            1 10.2906427 2 20.581285
7            1 13.8207624 1 13.820762
8            1 16.2938803 2 32.587761
9            1 18.3535591 3 55.060677
10           1 19.6946116 2 39.389223
attr(,"assign")
[1] 0 1 2 3

In case of factors

x
   (Intercept)          x y2 y3      x:y2       x:y3
1            1 -0.6264538  0  1  0.000000 -0.6264538
2            1  2.4058655  0  0  0.000000  0.0000000
3            1  3.6088158  1  0  3.608816  0.0000000
4            1  8.2619475  0  0  0.000000  0.0000000
5            1  9.2183967  0  0  0.000000  0.0000000
6            1 10.2906427  1  0 10.290643  0.0000000
7            1 13.8207624  0  0  0.000000  0.0000000
8            1 16.2938803  1  0 16.293880  0.0000000
9            1 18.3535591  0  1  0.000000 18.3535591
10           1 19.6946116  1  0 19.694612  0.0000000
attr(,"assign")
[1] 0 1 2 2 3 3
attr(,"contrasts")
attr(,"contrasts")$`y`
[1] "contr.treatment"

Edit 2: Part of the question can be also found here

One of the first steps within lm is constructing the design matrix which involves dummy encoding of factors. The design matrix is of course numeric and is what is passed to lm.fit. — Roland, Jul 09 '19 at 13:25
Perhaps see `?contrasts`? Or perhaps just look at `model.matrix(z ~ x * y)` using your example data? — Gregor Thomas, Jul 09 '19 at 13:32
@Gregor I understand, the model matrix already differs, which makes sense. But I still can't proceed. See my edits. — Christoph, Jul 09 '19 at 13:50
I don't understand. The model matrix obviously differs depending if you pass numeric variables or factor variables. You can find the source of R and recommended packages on github: https://github.com/wch/r-source — Roland, Jul 09 '19 at 13:53
@Roland I can't find `modelmatrix` or `Cdqrls` at that [github](https://github.com/SurajGupta/r-source/tree/master/src/library/stats/src) location. Do I miss something? And I can't find it in `C:\R\R-3.5.1\library\stats` (with subfolders demo, help, html, libs, Meta, R, tests) — Christoph, Jul 09 '19 at 14:09
[Here's the code in github](https://github.com/wch/r-source/blob/59fa5da9356244128d6d4e4ae20dad580b3f85d9/src/library/stats/R/models.R#L617), you can also view it by typing `model.matrix.default` in your R console. Tempted to close as a duplicate of the FAQ on [How to read the source code of an R function](https://stackoverflow.com/q/11173683/903061). — Gregor Thomas, Jul 09 '19 at 14:18
@Gregor thanks, I will have a look at it. However, I don't think that this question is a duplicate, because I want to understand what R implemented. In your code-link the work is done in line 669, the .Call... — Christoph, Jul 09 '19 at 14:23
https://github.com/wch/r-source/blob/trunk/src/library/stats/src/model.c#L327 — Roland, Jul 09 '19 at 14:51

Christoph · Accepted Answer · 2019-07-24T11:57:13.180

With the help of the answers to this question, I realized, that the answer is simple:

If the factors belong to the variables (predictor variable), the model.matrix just gets larger. Thus is it is clear, that C_Cdqrls can handle the model matrix.

Only if the dependent variable(s) contain factors, linear regression or lm doesn't work properly and discriminant analysis is one possibility. (At a first glance it seems, that stats::glm uses a logit model.

From Wikipedia:

Discriminant function analysis is very similar to logistic regression, and both can be used to answer the same research questions. Logistic regression does not have as many assumptions and restrictions as discriminant analysis. However, when discriminant analysis’ assumptions are met, it is more powerful than logistic regression. Unlike logistic regression, discriminant analysis can be used with small sample sizes. It has been shown that when sample sizes are equal, and homogeneity of variance/covariance holds, discriminant analysis is more accurate. With all this being considered, logistic regression has become the common choice, since the assumptions of discriminant analysis are rarely met.

Example:

x <- seq(0, 10, length.out = 21)
y <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
y <- as.factor(y)
df <- data.frame(x = x, y = y)

# see ??numeric and the ‘Warning’ section in factor:
plot(x, as.numeric(levels(y))[y], ylim = c(0, 1.2))

fit <- lm(y ~ x, data = df)
print(summary(fit))

fit_glm <- stats::glm(y ~ x, family = binomial(link = "logit"), data = df, control = list(maxit = 50))
print(summary(fit_glm))

df$glm.probs <- stats::predict(fit_glm, newdata = df, type = "response")
df$glm.pred = ifelse(glm.probs > 0.5, 1, 0)
points(x, df$glm.pred + 0.05, col = "red")

How does R's lm function handle factor levels (in C_Cdqrls?)?

1 Answers1