Why does `fastLm()` return results when I run a regression with one observation?

Question

Why does fastLm() return results when I run regressions with one observation?

In the following, why aren't the lm() and fastLm() results equal?

library(Rcpp)
library(RcppArmadillo)
library(data.table)
set.seed(1)
DT <- data.table(y = rnorm(5), x1 = rnorm(5), x2 = rnorm(5), my.key = 1:5)
#             y         x1         x2 my.key
# 1: -0.6264538 -0.8204684  1.5117812      1
# 2:  0.1836433  0.4874291  0.3898432      2
# 3: -0.8356286  0.7383247 -0.6212406      3
# 4:  1.5952808  0.5757814 -2.2146999      4
# 5:  0.3295078 -0.3053884  1.1249309      5

lm(y ~ 1 + x1 + x2, data = DT[my.key == 1])
# Coefficients:
# (Intercept)           x1           x2  
#     -0.6265           NA           NA

fastLm(X = model.matrix(y ~ 1 + x1 + x2, data = DT[my.key == 1]), y = DT[my.key == 1]$y)
# Coefficients:
# (Intercept)          x1          x2 
#    -0.15825     0.12984    -0.23924 

model.matrix(y ~ 1 + x1 + x2, data = DT[my.key == 1])
#   (Intercept)        x1       x2
#             1 -0.8204684 1.511781
# attr(,"assign")
# [1] 0 1 2

DT[my.key == 1]$y
# [1] -0.6264538

When I use the whole DT the results are equal:

 all.equal(fastLm(X = model.matrix(y ~ 1 + x1 + x2, data = DT), y = DT$y)$coef, 
           lm.fit(x = model.matrix(y ~ 1 + x1 + x2, data = DT), y = DT$y)$coef)
# [1] TRUE

From the RcppArmadillo library with a modified fLmTwoCasts I also get the same behavior:

src <- '
Rcpp::List fLmTwoCastsOnlyCoefficients(Rcpp::NumericMatrix Xr, Rcpp::NumericVector yr) {
    int n = Xr.nrow(), k = Xr.ncol();
    arma::mat X(Xr.begin(), n, k, false);
    arma::colvec y(yr.begin(), yr.size(), false);
    arma::colvec coef = arma::solve(X, y);
    return Rcpp::List::create(Rcpp::Named("coefficients")=trans(coef));
}
'
cppFunction(code=src, depends="RcppArmadillo")

XX <- model.matrix(y ~ 1 + x1 + x2, data = DT[my.key == 1])
YY <- DT[my.key == 1]$y
fLmTwoCastsOnlyCoefficients(XX, YY)$coef
#            [,1]      [,2]       [,3]
# [1,] -0.1582493 0.1298386 -0.2392384

Using the whole DT the coefficients are identical as they should:

lm(y ~ 1 + x1 + x2, data = DT)$coef
# (Intercept)          x1          x2 
#   0.2587933  -0.7709158  -0.6648270

XX.whole <- model.matrix(y ~ 1 + x1 + x2, data = DT)
YY.whole <- DT$y
fLmTwoCastsOnlyCoefficients(XX.whole, YY.whole)
#           [,1]       [,2]      [,3]
# [1,] 0.2587933 -0.7709158 -0.664827

"However, Armadillo will either fail or, worse, produce completely incorrect answers on rank-deficient model matrices whereas the functions from the stats package will handle them properly due to the modified Linpack code." — Roland, Jun 06 '16 at 12:24

Ben Bolker · Answer 1 · 2016-06-06T20:05:36.487

Because fastLm doesn't worry about rank-deficiency; this is part of the price you pay for speed.

From ?fastLm:

... The reason that Armadillo can do something like lm.fit faster than the functions in the stats package is because Armadillo uses the Lapack version of the QR decomposition while the stats package uses a modified Linpack version. Hence Armadillo uses level-3 BLAS code whereas the stats package uses level-1 BLAS. However, Armadillo will either fail or, worse, produce completely incorrect answers on rank-deficient model matrices whereas the functions from the stats package will handle them properly due to the modified Linpack code.

Looking at the code here, the guts of the code are

 arma::colvec coef = arma::solve(X, y);

This does a QR decomposition. We can match the lmFast results with qr() from base R (here I am not using only base R constructs rather than relying on data.table):

set.seed(1)
dd <- data.frame(y = rnorm(5), 
      x1 = rnorm(5), x2 = rnorm(5), my.key = 1:5)

X <- model.matrix(~1+x1+x2, data=subset(dd,my.key==1))
qr(X,dd$y)
## $qr
##   (Intercept)         x1       x2
## 1           1 -0.8204684 1.511781

You can look at the code of lm.fit() to see what R does about rank deficiency when fitting linear models; the underlying BLAS algorithm it calls does QR with pivoting ...

If you want to flag these situations, I think that Matrix::rankMatrix() will do the trick:

library(Matrix)
rankMatrix(X) < ncol(X)  ## TRUE
X1 <- model.matrix(~1+x1+x2, data=dd)
rankMatrix(X1) < ncol(X1)  ## FALSE

Hopefully, this is the only reason, right? So if I check the definiteness of X'X will prevent me for considering such cases. That is, `library(matrixcalc); is.positive.definite(crossprod(XX))` is `FALSE`, while `is.positive.definite(crossprod(XX))` is `TRUE`, in which case I do not consider the first case (XX has only one observation, while XX.whole 5). — Konstantinos, Jun 06 '16 at 12:34
I assume not, but I don't know. Here's the documentation for `arma::solve` ... http://arma.sourceforge.net/docs.html#solve — Ben Bolker, Jun 06 '16 at 12:43
FWIW the RcppEigen package has a fuller set of decompositions in its `fastLm()` example. But here it is as Ben so eloquently stated: _"this is part of the price you pay for speed."_. We give you enough rope to hang yourself. If you want to protect yourself from yourself, stick with `lm()` or `lm.fit()`, or create a hybrid 'fast-but-safe' version. — Dirk Eddelbuettel, Jun 06 '16 at 16:49

Why does `fastLm()` return results when I run a regression with one observation?

1 Answers1

Linked