1

I'm currently looping through a large data set and what I discovered is that the higher loop index, the slowlier the loop is. It goes pretty fast at the beginning, but it's incredibly slow at the end. What's the reason for this? Is there any way how to bypass it?

Remarks: 1) I can't use plyr because the calculation is recursive. 2) The length of output vector is not known in advance.

My code looks rougly like this:

  for (i in 1:20000){

     if(i == 1){

        temp <- "some function"(input data[i])
        output <- temp

     } else {

       temp <- "some function"(input data[i], temp)
       out <- rbind(out, temp)
     }    
  }
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
Steef Gregor
  • 544
  • 1
  • 7
  • 21
  • what is the function doing? – marbel Feb 23 '14 at 22:19
  • 4
    Preallocate instead of using `rbind`. If you don't know the length `out` will be, then over-allocate it and trim it when the loop terminates. Or, if `"some function"` returns a vector, make `out` a list and `unlist` it. – Joshua Ulrich Feb 23 '14 at 22:21
  • 1
    @JoshuaUlrich nailed it. See http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r/8474941#8474941 for more suggestions, but that's almost certainly the culprit here given the progressive slow-down (as the data being rbound gets bigger and memory gets more fragmented and you start pegging the disk). – Ari B. Friedman Feb 23 '14 at 22:23
  • Better yet: use the [appropriate higher-order list function](http://stat.ethz.ch/R-manual/R-devel/library/base/html/funprog.html) (`ply` is simply the wrong function here). The right function is conventionally known as a “prefix sum” or “scan”, but good luck finding that for R. R unfortunately only knows it as an accumulative `Reduce`. – Konrad Rudolph Feb 23 '14 at 22:58

1 Answers1

4

The problem is that you are growing the object out at each iteration, which will entail larger and larger amounts of copying as the size of out increases (as your loop index increases).

In this case, you know the loop needs a vector of 20000 elements, so create one initially and fill in that object as you loop. Doing this will also remove the need for the if() ... else() which is also slowing down your loop and will become appreciable as the size of the loop increases.

For example, you could do:

out <- numeric(20000)
out[1] <- foo(data[1])
for (i in 2:length(out)) {
  out[i] <- foo(data[i], out[i-1])
}

What out needs to be when you create it will depend on what foo() returns. Adjust creation of out accordingly.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453