2

For a neuroimaging application, I'm trying to fit many linear models by least squares in R (standard call to lm). Imagine I have a design matrix X. This design matrix will be the same across all of the models. The data (Y) that is being fit will change, and as a result so will all of the fit parameters (e.g. betas, p-values, residuals, etc).

At present, I'm just sticking it in a for loop, so it's doing hundreds of thousands of calls to lm. It seems like there's got to be a better way.

I believe the most computationally expensive piece is the matrix inversion. It looks like this gets handled with a Fortran call in lm.fit.

If I were doing this regression by hand, I'd do the matrix inversion, then just multiply it by the various datasets. In fact, I've coded up a function to do that when I have well-behaved design matrices (e.g. all continuously valued covariates). However, I really like all of the work that lm does, like recoding my factors appropriately, etc, and the output of lm is really nice, too.

Is there anyway to have my cake and eat it, too? Namely, to get the friendliness of lm, but use that power to computationally efficiently fit many models with identical design matrices?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
Daniel Kessler
  • 1,172
  • 17
  • 21
  • @gung Yes, for this question I think you are right. I'll go ahead and flag. Thanks for your help! – Daniel Kessler Jan 28 '13 at 20:47
  • 1
    Whichever algorithm you use, once the QR decomposition of the model matrix is computed, *it shouldn't be recomputed* for a new y. The only part of the calculation that should be repeated is the part that depends on y. So responses that avoid that recomputation are *ceteris paribus*, much more efficient than the ones that don't. For any suggestion that doesn't suggest avoiding it, there will be an equivalent response that does avoid it which should be faster. If you want to do the regressions separately, you can re-use the already computed parts from before; there are several ways to do it – Glen_b Jan 29 '13 at 03:12

3 Answers3

11

Yes, there is a better way. We have been writing example replacement functions fastLm() based on using external C / C++ code from Armadillo, GSL and and Eigen in the packages RcppArmadillo, RcppGSL and RcppEigen.

By far the largest amount of time is spent setting up the model matrix and deparsing the formula. You can read the source of lm(), or maybe ours in fastLm(), and see how to do this parsing just once. Keep the right-hand side, and then loop over your different y vectors. Which fitting function you use matters less. I like fastLm() from RcppArmadillo, but hey, I also wrote it :)

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • 1
    I recently came across fastLm and yes indeed its fast!!. I' trying to replace a call to lm.wfit() with fastLm(). So I'm calling it as : **fastLm(X * wts, y * wts, method = 0L)**. The coefficients matched with those from lm.wfit() but the fitted values didn't, do you have idea on what could I be missing ? . I cannot ask question because of bad performance on SO so asking it here. Really sorry! – joel.wilson Oct 14 '16 at 08:11
  • Thinking error on your part: use `sqrt(w)`, not `w`. – Dirk Eddelbuettel Oct 14 '16 at 11:17
  • I have included that.. actually wts <- sqrt(w); Sorry I didn;t mention that. However I got the solution. actually the fitted values from fastLm also needs to be divided by wts `fastLm$fitted.values <- fastLm$fitted.values/wts` `fastLm$residuals <- fastLm$residuals/wts` – joel.wilson Oct 14 '16 at 12:33
  • I am not sure what point you are trying to make: not one of the three implementations of `fastLm()` claims to work with weights ... – Dirk Eddelbuettel Oct 14 '16 at 13:03
8

From the help page for lm:

If ‘response’ is a matrix a linear model is fitted separately by least-squares to each column of the matrix.

So it would seem that a simple approach would be to combine all the different y vectors into a matrix and pass that as the response in a single call to lm. For example:

(fit <- lm( cbind(Sepal.Width,Sepal.Length) ~ Petal.Width+Petal.Length+Species, data=iris))
summary(fit)
summary(fit)[2]
coef(summary(fit)[2])
coef(summary(fit))[2]
sapply( summary(fit), function(x) x$r.squared )
Greg Snow
  • 48,497
  • 6
  • 83
  • 110
  • This seems to do the calculation really efficiently, yay! However, for really large models (ncol(response) ~= 600k) the summary takes a very long time to calculate. I need to pull out t-values and p-values corresponding to the betas, and the only way I know to do that is with summary. Any ideas for a more efficient way to pull those out, or am I stuck rewriting summary to be more efficient? – Daniel Kessler Jan 29 '13 at 23:40
  • You could look at the functions like `estVar` or `vcov`, they might be a faster (then calculate the test statistics from that info). – Greg Snow Jan 30 '13 at 05:38
5

I do not know of a better way using lm; but you may want to consider using function lsfit. Although simpler and with less bells and whistles, the syntax lsfit(X,y) allows for y to be not just a vector with values of the response variable, but also a matrix. Then a single call to lsfit fits all columns of y by regressing them on the same design matrix X. Quite fast and handy.

F. Tusell
  • 414
  • 6
  • 16