4

The overall goal of the exercise is to find perfectly collinear columns in a very large, but sparse matrix X in R in the context of a linear regression problem. One approach that pops up from time to time is to make use of the underlying QR decomposition of lm() to extract the OLS coefficients and drop all variables that are assigned NA as estimate. While base::qr() is much too slow to be a viable option, Matrix::qr() handles the sparse matrix well and is quite fast. Unfortunately, the results of the two implementations are quite different.

Consider the toy data set

# random design matrix
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2
X  <- cbind(1, x1, x2, x3)

# random outcome variable
y  <- rnorm(100)

where variable x3 is a linear combination of x1 and x2 and therefore the goal is to drop x3 from X. We can quickly estimate the OLS coefficients using base as follows

# base implementation
qr.base   <- base::qr(X)
coef.base <- base::qr.coef(qr.base, y)
coef.base
                     x1          x2          x3 
-0.04330410 -0.04889519  0.07719935          NA

Here, the coefficient of x3 is automatically set to NA, as desired. Similar results can be obtained by using the dense matrix X as input for Matrix::qr(). However, things change when working with a sparse matrix class:

# Matrix implementation (sparse)
X.sparse    <- Matrix::sparse.model.matrix(~ x1 + x2 + x3)
qr.matrix   <- Matrix::qr(X.sparse)
coef.matrix <- Matrix::qr.coef(qr.matrix, y)
coef.matrix
  (Intercept)            x1            x2            x3 
-3.125000e-02 -2.088811e+14 -2.088811e+14  2.088811e+14 

where something clearly goes wrong at some point due to X not being of full rank. Is there a way to generate results similar to what base::qr() gives using Matrix::qr()?

yrx1702
  • 1,619
  • 15
  • 27

2 Answers2

4

It looks like Matrix::qr.coef(qr.matrix, y) is giving a different answer because it never drops columns/pivots to deal with collinearity. It thus seems like you would need to deal with this prior to computing the QR decomposition of X.sparse. If you wanted to implement this, one way to consider which columns need to be dropped uses a trick based on the reduced row echelon form of t(X) %*% X that works as follows.

Consider a slight variation on your toy data set:

# random design matrix
set.seed(1234)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2
x4 <- rnorm(100)
x5 <- x4
x6 <- rnorm(100)
X  <- cbind(intercept=1, x1, x2, x3, x4, x5, x6)

# random outcome variable
y  <- rnorm(100)

I've added two "flavors" of collinearity because I think it's helpful to see the distinction in what follows. I also added an additional independent predictor x6 to show how the pattern showcased continues with additional predictors.

We can extract information about collinearity from the matrix t(X) %*% X. This is because it's actually the inability to invert this singular matrix that relates to the key problems caused by collinearity in the first place. In ordinary least squares, this is also what prevents us from simply being able directly to calculate

beta <- solve(t(X) %*% X) %*% t(X) %*% y
Error in solve.default(t(X) %*% X) : 
  Lapack routine dgesv: system is exactly singular: U[6,6] = 0

The process works as follows

# compute t(X) %*% X
tX_X <- t(X) %*% X

# compute the reduced row echelon form of tX_X
rref_tX_X <- pracma::rref(tX_X)
rref_tX_X
          intercept x1 x2 x3 x4 x5 x6
intercept         1  0  0  0  0  0  0
x1                0  1  0  1  0  0  0
x2                0  0  1  1  0  0  0
x3                0  0  0  0  1  1  0
x4                0  0  0  0  0  0  1
x5                0  0  0  0  0  0  0
x6                0  0  0  0  0  0  0

# relate predictors to each other to "visualize" non-independence
collinear_sets_matrix <- t(rref_tX_X) %*% rref_tX_X
collinear_sets_matrix
          intercept x1 x2 x3 x4 x5 x6
intercept         1  0  0  0  0  0  0
x1                0  1  0  1  0  0  0
x2                0  0  1  1  0  0  0
x3                0  1  1  2  0  0  0
x4                0  0  0  0  1  1  0
x5                0  0  0  0  1  1  0
x6                0  0  0  0  0  0  1

In collinear_sets_matrix, any row/column i with a 1 in position i and 0s elsewhere does not belong to a collinear set. In the example above, this only includes the intercept and predictor x6.

Given rref_tX_X, one could also work through the matrix columns left to right to determine predictors to keep and predictors to drop. Keep a counter k of columns to keep that starts at 1. When a column in rref_tX_X has a 1 in position k and 0s elsewhere, keep that column and increment k. Continue checking all columns.

In the example above, we would keep intercept, x1, x2, x4, and x6, which also matches values that are not NA following the base::qr() and base::qr.coef() process

qr.base   <- base::qr(X)
coef.base <- base::qr.coef(qr.base, y)
coef.base
   intercept           x1           x2           x3           x4           x5 
-0.040390807 -0.067473889  0.001486699           NA  0.046349008           NA 
      x6 
-0.098773117
1

One somewhat hacky workaround I found includes the sparse solver that is the top answer in this post. Using the example from the original question, we can set up everything as usual:

# random design matrix
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2
X  <- cbind(1, x1, x2, x3)

# random outcome variable
y  <- rnorm(100)

# sparse explanatory matrix
X.sparse    <- Matrix::sparse.model.matrix(~ x1 + x2 + x3)

Now for the solver to work, we need to precompute t(X) %*% y and convert it to a numeric object as the solver expects double as second input object (which can probably be changed easily).

# compute X'y part of OLS solution
Xy = as.numeric(Matrix::t(X.sparse) %*% y)

If we then run the solver, we see that

# solve system
sparseLm_eigen(Matrix::t(X.sparse) %*% X.sparse, Xy)

will output

$status
[1] TRUE

$betahat
[1] -0.19579441  0.03075005  0.12532105  0.00000000

and assigns a hard zero to each perfectly collinear covariate. I have used this to filter out perfectly collinear variables before, but one needs to be a bit careful. It's not extremely likely, but you run the danger of accidentally filtering out variables that have an effect that is numerically equivalent to zero.

yrx1702
  • 1,619
  • 15
  • 27