1

I'm using the R built-in lm() function in a loop for estimating a custom statistic:

for(i in 1:10000)
  {
    x<-rnorm(n)
    reg2<-lm(x~data$Y)
    Max[i]<-max(abs(rstudent(reg2)))
  }

This is really slow when increasing both the loop counter (typically we want to test over 10^6 or 10^9 iterations values for precision issues) and the size of Y.

Having read the following Stack topic, a very first attemp was to try optimizing the whole using parallel regression (with calm()):

  cls = makeCluster(4)
  distribsplit(cls, "test")
  distribsplit(cls, "x")
  for(i in 1:10000)
  {
    x<-rnorm(n)
    reg2 <- calm(cls, "x ~ test$Y, data = test")
    Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
  }

This ended with a much slower version (by a factor 6) when comparing with the original, unparalleled loop. My assumption is that we ask for creating /destroying the threads in each loop iteration and that slow down the process a lot in R.

A second attemp was to use lm.fit() according to this Stack topic:

  for(i in 1:10000)
  {
    x<- rnorm(n)
    reg2<- .lm.fit(as.matrix(x), data$Y)
    Max[i]<-max(abs(reg2$residuals / sd(reg2$residuals)))
  }

It resulted in a much faster processing compared to the initial and orgininal version. Such that we now have: lm.fit() < lm() < calm(), speaking of overall processing time.

However, we are still looking for options to improve the efficiency (in term of processing time) of this code. What are the possible options? I assume that making the loop parallel would save some processing time?

Edit: Minimal Example

Here is a minimal example:

#Import data
sample <- read.csv("sample.txt")
 
#Preallocation
Max <- vector(mode = "numeric", length = 100)
 
n <- length(sample$AGE)
x <- matrix(rnorm(100 * n), 100)
 
for(i in 1 : 100)
{
  reg <- lm(x ~ data$AGE)
  Max[i] <- max(abs(rstudent(reg)))
}

with the following dataset 'sample.txt':

AGE
51
22
46
52
54
43
61
20
66
27

From here, we made several tests and noted the following:

  • Following @Karo contribution, we generate the matrix of normal samples outside the loop to spare some execution time. We expected a noticeable impact, but run tests indicate that doing so produce the unexpected inverse results (i.e. a longer execution time). Maybe the effect reverse when increasing the number of simulations.
  • Following @BenBolker uggestion, we also tested fastlm() and it reduces the execution time but the results seem to differ (from a factor 0.05) compared to the typical lm()

We are still struggling we effectively reducing the execution time. Following @Karo suggestions, we will try to directly pass a vector to lm() and investigate parallelization (but failed with calm() for an unknown reason).

Grasshoper
  • 457
  • 2
  • 13
  • 3
    What is the value of `n`? – Rui Barradas Jul 29 '20 at 17:19
  • 1
    I can't tell from your code, but it looks like you could grab a couple equations out of a stats book and calculate your `Max[i]` analytically. – Carl Witthoft Jul 29 '20 at 17:40
  • @RuiBarradas the size of _n_ can typically reach 10^3 and even 10^4 values for some datasets. – Grasshoper Jul 29 '20 at 17:56
  • @CarlWitthoft the statistic depends on the design of the model, defined by the user. – Grasshoper Jul 29 '20 at 17:56
  • 3
    first why are you regressing x on y? Should we not regress y on x?? How can a random generated value be dependent on the data$Y?. Well if at all you are correct then think of doing `m<-rnorm(n*1000,n);lm(m~dat$Y)` This will reduce the number of iterations – Onyambu Jul 29 '20 at 18:21
  • 1
    how about `fastLm` https://stackoverflow.com/questions/30420185/fastlm-is-much-slower-than-lm ? – Ben Bolker Jul 29 '20 at 19:46
  • 1
    I'm with @RuiBarradas: can you tell us what `data$Y`/ the value of `n` are? (Or what a typical value of `n` is/what value you tested with - since you say it varies) – Ben Bolker Jul 29 '20 at 19:52
  • 1
    @RuiBarradas I will make a minimal example to reflect the issue. – Grasshoper Jul 29 '20 at 19:54
  • @BenBolker in the benchmark provided fastlm() is slower than lm.fit() that is our best solution at the moment. – Grasshoper Jul 29 '20 at 19:55
  • 1
    how carefully did you read that answer? as I read Dirk Eddelbuettel's answer, it was slower *if a formula needed to be deparsed* – Ben Bolker Jul 29 '20 at 20:00
  • @Onyambu We want to estimate the law for the maximum of the studentized residuals. We use simulations for this. Using `rnorm(n*1000, n)` generate 1000*n gaussians of mean n and variance 1 that is not appropriate for our test. – Grasshoper Jul 30 '20 at 16:34
  • @BenBolker We changed to `fastlm()` but the results differs from the typical `lm()` i will update the post with a minimal example now. – Grasshoper Jul 30 '20 at 16:34
  • 1
    @Grasshoper sorry meant to write `m<-matrix(rnorm(n*1000),n);lm(m~dat$Y)`. this will reduce the computation time – Onyambu Jul 30 '20 at 16:36

2 Answers2

2

Since I still can't comment:

Try to avoid loops in R. For some reason you are recalculating those random numbers every iteration. You can do that without a loop:

duration_loop <- system.time({
  for(i in 1:10000000)
  {
    x <- rnorm(10)
  }
})

duration <- system.time({
  m <- matrix(rnorm(10000000*10), 10000000)
})

Both ways should create 10 random values per iteration/matrix row with the same amount of iterations/rows. Though both ways seem to scale linearly, you should see a difference in execution time, the loop will probably be CPU-bound and the "vectorized" way probably memory-bound.

With that in mind you probably should and most likely can avoid the loop altogether, you can for instance pass a vector into the lm-function. If you still need to be faster after that you can definitely parallelise a number of ways, it would be easier to suggest how with a working example of data.

karo
  • 56
  • 4
2

Wide-ranging comments above, but I'll try to answer a few narrower points.

  • I seem to get the same (i.e., all.equal() is TRUE) results with .lm.fit and fitLmPure, if I'm careful about random-number seeds:
library(Rcpp)
library(RcppEigen)
library(microbenchmark)

nsim <- 1e3
n <- 1e5
set.seed(101)
dd <- data.frame(Y=rnorm(n))

testfun <- function(fitFn=.lm.fit, seed=NULL) {
    if (!is.null(seed)) set.seed(seed)
    x <- rnorm(n)
    reg2 <- fitFn(as.matrix(x), dd$Y)$residuals
    return(max(abs(reg2) / sd(reg2)))
}

## make sure NOT to use seed=101 - also used to pick y - 
## if we have y==x then results are unstable (resids approx. 0)
all.equal(testfun(seed=102), testfun(fastLmPure,seed=102)) ## TRUE

fastLmPure is fastest (but not by very much):

(bm1 <- microbenchmark(testfun(),
                     testfun(lm.fit),
                     testfun(fastLmPure),
                     times=1000))

Unit: milliseconds
                expr      min       lq      mean   median        uq      max
           testfun() 6.603822 8.234967  8.782436 8.332270  8.745622 82.54284
     testfun(lm.fit) 7.666047 9.334848 10.201158 9.503538 10.742987 99.15058
 testfun(fastLmPure) 5.964700 7.358141  7.818624 7.471030  7.782182 86.47498

If you wanted to fit many independent responses, rather than many independent predictors (i.e. if you were varying Y rather than X in the regression), you could provide a matrix for Y in .lm.fit, rather than looping over lots of regressions, which might be a big win. If all you care about are "residuals of random regressions" that might be worth a try. (Unfortunately, providing a matrix that combines may separate X vectors runs a multiple regression, not many univariate regressions ...)

Parallelizing is worthwhile, but will only scale (at best) according to the number of cores you have available. Doing a single run rather than a set of benchmarks because I'm lazy ...

Running 5000 replicates sequentially takes about 40 seconds for me (modern Linux laptop).

system.time(replicate(5000,testfun(fastLmPure), simplify=FALSE))
##    user  system elapsed 
##  38.953   0.072  39.028 

Running in parallel on 5 cores takes about 13 seconds, so a 3-fold speedup for 5 cores. This will probably be a bit better if the individual jobs are larger, but obviously will never scale better than the number of cores ... (8 cores didn't do much better).

library(parallel)
system.time(mclapply(1:5000, function(x) testfun(fastLmPure),
                     mc.cores=5))
##    user  system elapsed 
##  43.225   0.627  12.970 

It makes sense to me that parallelizing at a higher/coarser level (across runs rather than within lm fits) will perform better.

I wonder if there are analytical results you could use in terms of the order statistics of a t distribution ... ?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453