4

I've used lm() to fit multiple regression models, for multiple (~1 million) response variables in R. Eg.

allModels <- lm(t(responseVariablesMatrix) ~ modelMatrix)

This returns an object of class "mlm", which is like a huge object containing all the models. I want to get the t-statistic for the first coefficient in each model, which I can do using the summary(allModels) function, but its very slow on this large data and returns a lot of unwanted info too.

Is there a faster way of calculating the t-statistic manually, that might be faster than using the summary() function

Thanks!

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
user2141118
  • 103
  • 1
  • 1
  • 4

1 Answers1

1

You can hack the summary.lm() function to get just the bits you need and leave the rest.

If you have

nVariables <- 5
nObs <- 15

y <- rnorm(nObs)
x <- matrix(rnorm(nVariables*nObs),nrow=nObs)

allModels <-lm(y~x)

Then this is the code from the lm.summary() function but with all the excess baggage removed (note, all the error handling has been removed as well).

p <- allModels$rank
rdf <- allModels$df.residual
Qr <- allModels$qr
n <- NROW(Qr$qr)
p1 <- 1L:p
r <- allModels$residuals
f <- allModels$fitted.values
w <- allModels$weights
mss <- if (attr(allModels$terms, "intercept")) 
sum((f - mean(f))^2) else sum(f^2)
rss <- sum(r^2)
resvar <- rss/rdf
R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
se <- sqrt(diag(R) * resvar)
est <- allModels$coefficients[Qr$pivot[p1]]
tval <- est/se

tval is now a vector of the t statistics as also give by

summary(allModels)$coefficients[,3]

If you have problems on the large model you might want to rewrite the code so that it keeps fewer objects by compounding multiple lines/assignments into fewer lines.

Hacky solution I know. But it will be about as fast as possible. I suppose it would be neater to put all the lines of code into a function as well.

timcdlucas
  • 1,334
  • 8
  • 20