The overall goal of the exercise is to find perfectly collinear columns in a very large, but sparse matrix X
in R in the context of a linear regression problem. One approach that pops up from time to time is to make use of the underlying QR decomposition of lm()
to extract the OLS coefficients and drop all variables that are assigned NA
as estimate. While base::qr()
is much too slow to be a viable option, Matrix::qr()
handles the sparse matrix well and is quite fast. Unfortunately, the results of the two implementations are quite different.
Consider the toy data set
# random design matrix
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2
X <- cbind(1, x1, x2, x3)
# random outcome variable
y <- rnorm(100)
where variable x3
is a linear combination of x1
and x2
and therefore the goal is to drop x3
from X
. We can quickly estimate the OLS coefficients using base
as follows
# base implementation
qr.base <- base::qr(X)
coef.base <- base::qr.coef(qr.base, y)
coef.base
x1 x2 x3
-0.04330410 -0.04889519 0.07719935 NA
Here, the coefficient of x3
is automatically set to NA
, as desired. Similar results can be obtained by using the dense matrix X
as input for Matrix::qr()
. However, things change when working with a sparse matrix class:
# Matrix implementation (sparse)
X.sparse <- Matrix::sparse.model.matrix(~ x1 + x2 + x3)
qr.matrix <- Matrix::qr(X.sparse)
coef.matrix <- Matrix::qr.coef(qr.matrix, y)
coef.matrix
(Intercept) x1 x2 x3
-3.125000e-02 -2.088811e+14 -2.088811e+14 2.088811e+14
where something clearly goes wrong at some point due to X
not being of full rank. Is there a way to generate results similar to what base::qr()
gives using Matrix::qr()
?