3

I'm struggling to vectorize the repeated application of the following function, which I have currently implemented as a for loop. This small example is indicative of a problem with a larger dataset, for which vectorizing would allow for beneficial runtime improvements:

action = function(x,y,i) {

    firsttask = cumsum(x[which(x<y[i])])
    secondtask = mean(firsttask)
    thirdtask = min(firsttask[which(firsttask>secondtask)])
    fourthtask = length(firsttask)

    output = list(firsttask, data.frame(average=secondtask,
                                        min_over_mean=thirdtask,
                                        size=fourthtask))
    return(output)
}

thingofinterest = c(1:10)
limits = c(5:10)

test = vector("list", length = length(limits))
for(i in 1:length(limits)) {
test[[i]] = action(thingofinterest, limits, i)
}

test

I want to replace the for loop with a vectorized command, not any of the apply family of functions, as they do not always improve performance (I'm not suggesting there is anything wrong with for loops, I just need to optimize for speed in this case. See: Is R's apply family more than syntactic sugar?). How would I do this?

Community
  • 1
  • 1
Nigel Stackhouse
  • 354
  • 2
  • 15
  • For loops can usually vectorized using `apply` or `do.call`. I think in this case, the latter would be most usefull. – RHA Oct 09 '15 at 11:44
  • I'm trying to avoid `apply`, as many functions in that family are just glorified for loops. `do.call` only outputs `action(thingofinterest,5)`. – Nigel Stackhouse Oct 09 '15 at 11:49
  • Why are you trying to avoid loops? There's nothing inherently wrong with implicit or explicit loops. – Joshua Ulrich Oct 09 '15 at 11:55
  • 2
    Two things you could easily improve: initialize the `test` vector to the correct size upfront using `test = vector("list", length = length(limits))` to avoid growing the object in a loop (which makes it slow!) and in your `action` function you could compute `mean(firsttask)` before the second step and used the value stored in a variable – talat Oct 09 '15 at 11:56
  • How about the Parapply family? That might speed things up a bit. – phiver Oct 09 '15 at 11:57
  • You can see if lapply speeds it up `lapply(1:length(limits), action, x=thingofinterest, y=limits)` – Pierre L Oct 09 '15 at 12:01
  • @JoshuaUlrich: the for loop is fine here, but I'm applying this general idea to a much larger dataset that takes hours to get through the for loop. I'm trying to speed it up by vectorizing it. – Nigel Stackhouse Oct 09 '15 at 12:03
  • @docendodiscimus: Thanks for the tips, I've incorporated them now. – Nigel Stackhouse Oct 09 '15 at 12:04
  • 1
    @NigelStackhouse *apply family are C implementated loops, calling the corresponding C implementation of base R function when applicable. So the statement "essentially masked for loops" is really wrong. There's nothing really wrong with for loops nor *apply calling custom complex functions when you avoid growing objects within them and repeating the same call again and again. – Tensibai Oct 09 '15 at 12:18
  • @Tensibai: I've edited the question so as not to mislead. Again, I'm not saying there's anything wrong with for loops, I just don't want to use one for these purposes as it is not the fastest option and I need to optimize for speed. – Nigel Stackhouse Oct 09 '15 at 12:25
  • @NigelStackhouse, how large is your real data? – talat Oct 09 '15 at 12:26
  • @NigelStackhouse I'm working on it, I'm pretty sure there's improvement to your code (at first glance the which which is not needed as as logical vector from the `<` will give the same result. I'm moving your code intao a function to be able to compare results and benchmark over my idea (I'll post the benchmark of alternatives along with others answers (if any) so you'll have a correct picture of all of them on the same machine at end :) ). A [`lineprof`](https://github.com/hadley/lineprof) on your code give correct clues on where the bottlenecks are. – Tensibai Oct 09 '15 at 12:30
  • @docendodiscimus, the `thingofinterest` analog is 13 million elements long; `limits` can vary but is usually between 90 and 110 elements long. – Nigel Stackhouse Oct 09 '15 at 12:32
  • And as we're at it I recommend reading [is-the-apply-family-really-not-vectorized](http://stackoverflow.com/questions/28983292/is-the-apply-family-really-not-vectorized) (Main idea behind this comments being: if you're calling functions only accepting scalars, you'll have a bad time avoiding loops) – Tensibai Oct 09 '15 at 12:33
  • @NigelStackhouse I'm made an assumption that I `thingofinterest`would be random values and not always increasing, is it a correct assumption or not ? – Tensibai Oct 09 '15 at 13:43
  • @Tensibai, that is a correct assumption; the flexibility would be ideal, in fact! – Nigel Stackhouse Oct 09 '15 at 13:50
  • @NigelStackhouse In this case (if it they can't be sorted) the main drawback is the cumsum which gives different results depending on the selection made :/ I really see no way to improve speed out of slicing the limits and doing parallel processing – Tensibai Oct 09 '15 at 13:59

3 Answers3

4

You need to understand where the bottlenecks are in your code before you start trying to change it to make it faster. For example:

timer <- function(action, thingofinterest, limits) {
  st <- system.time({           # for the wall time
    Rprof(interval=0.01)        # Start R's profile timing
    for(j in 1:1000) {          # 1000 function calls
      test = vector("list")
      for(i in 1:length(limits)) {
        test[[i]] = action(thingofinterest, limits, i)
      }
    }
    Rprof(NULL)  # stop the profiler
  })
  # return profiling results
  list(st, head(summaryRprof()$by.total))
}
action = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  secondtask = min(firsttask[which(firsttask>mean(firsttask))])
  thirdtask = mean(firsttask)
  fourthtask = length(firsttask)
  output = list(firsttask, data.frame(average=secondtask,
                                      min_over_mean=thirdtask,
                                      size=fourthtask))
  return(output)
}
timer(action, 1:1000, 50:100)
# [[1]]
#    user  system elapsed 
#   9.720   0.012   9.737 
# 
# [[2]]
#                 total.time total.pct self.time self.pct
# "system.time"         9.72    100.00      0.07     0.72
# "timer"               9.72    100.00      0.00     0.00
# "action"              9.65     99.28      0.24     2.47
# "data.frame"          8.53     87.76      0.84     8.64
# "as.data.frame"       5.50     56.58      0.44     4.53
# "force"               4.40     45.27      0.11     1.13

You can see that very little time is spent outside the call to your action function. Now, for is a special primitive and is therefore not captured by the profiler, but the total time reported by the profiler is very similar to the wall time, so there can't be a lot of time missing from the profiler time.

And the thing that takes the most time in your action function is the call to data.frame. Remove that, and you get an enormous speedup.

action1 = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  secondtask = mean(firsttask)
  thirdtask = min(firsttask[which(firsttask>mean(firsttask))])
  fourthtask = length(firsttask)
  list(task=firsttask, average=secondtask,
    min_over_mean=thirdtask, size=fourthtask)
}
timer(action1, 1:1000, 50:100)
# [[1]]
#    user  system elapsed 
#   1.020   0.000   1.021 
# 
# [[2]]
#               total.time total.pct self.time self.pct
# "system.time"       1.01    100.00      0.06     5.94
# "timer"             1.01    100.00      0.00     0.00
# "action"            0.95     94.06      0.17    16.83
# "which"             0.57     56.44      0.23    22.77
# "mean"              0.25     24.75      0.13    12.87
# "<"                 0.20     19.80      0.20    19.80

Now you can also get rid of one of the calls to mean and both calls to which.

action2 = function(x,y,i) {
  firsttask = cumsum(x[x < y[i]])
  secondtask = mean(firsttask)
  thirdtask = min(firsttask[firsttask > secondtask])
  fourthtask = length(firsttask)
  list(task=firsttask, average=secondtask,
    min_over_mean=thirdtask, size=fourthtask)
}
timer(action2, 1:1000, 50:100)
# [[1]]
#    user  system elapsed 
#   0.808   0.000   0.808 
# 
# [[2]]
#               total.time total.pct self.time self.pct
# "system.time"       0.80    100.00      0.12    15.00
# "timer"             0.80    100.00      0.00     0.00
# "action"            0.68     85.00      0.24    30.00
# "<"                 0.20     25.00      0.20    25.00
# "mean"              0.13     16.25      0.08    10.00
# ">"                 0.05      6.25      0.05     6.25

Now you can see there's a "significant" amount of time spent doing stuff outside your action function. I put significant in quotes because it's 15% of the runtime, but only 120 milliseconds. If your actual code took ~12 hours to run, this new action function would finish in ~1 hour.

The results would be marginally better if I pre-allocated the test list outside of the for loop in the timer function, but the call to data.frame is the biggest time-consumer.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • 1
    So you're suggesting the same as I did except you removed `which`? (Btw, I read somewhere that subsetting by integer vector is slightly faster than by logical index?) – talat Oct 09 '15 at 13:18
  • @docendodiscimus: I'm suggesting a process, and by using that process I made similar suggestions. I removed the two calls to `which` because they appeared as 50% of the run time. Subsetting by integer is faster than subsetting by logical, because the logical vector must be converted to its integer index locations before the object can be subset in the C code. That said, the call to `which` is not free, so you must evaluate the tradeoff between slower logical index subsetting and conversion to integer outside of the `[` call. – Joshua Ulrich Oct 09 '15 at 13:29
  • Add three 0 to the thingofinterest (at least as OP is talking about a 13M length vector) and the call to which is clearly noticeable. – Tensibai Oct 09 '15 at 13:35
  • 1
    This was a great mini lesson in the value of evaluating my code in depth. The ~12 times speed increase is great. – Nigel Stackhouse Oct 09 '15 at 14:01
3

Here's a little comparison in reference to my comment above. I've made the changes as in the comment (initialize test, change order in action and I removed the data.frame call in the list output of action if you can accept that):

library(microbenchmark)
microbenchmark(f0(), f1())

Unit: microseconds
 expr       min        lq      mean     median        uq       max neval
 f0() 14042.192 14730.036 16091.508 15168.3175 16993.631 28193.770   100
 f1()   894.555   928.738  1094.448   985.2865  1190.252  4710.675   100

These modifications brought a ~15 times speed up.

Functions and data for comparison:

action0 = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  secondtask = min(firsttask[which(firsttask>mean(firsttask))])
  thirdtask = mean(firsttask)
  fourthtask = length(firsttask)
  output = list(firsttask, data.frame(min_over_mean=secondtask,
                                      average=thirdtask,
                                      size=fourthtask))
  return(output)
}

f0 <- function() {
  test = vector("list")
  for(i in 1:length(limits)) {
    test[[i]] = action0(thingofinterest, limits, i)
  }
}

thingofinterest = c(1:1000)
limits = c(50:100)

action1 = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  thirdtask = mean(firsttask)
  secondtask = min(firsttask[which(firsttask>thirdtask)])
  fourthtask = length(firsttask)
  list(firsttask, min_over_mean=secondtask,
                                      average=thirdtask,
                                      size=fourthtask)
}

f1 <- function() {
  test = vector("list", length = length(limits))
  for(i in 1:length(limits)) {
    test[[i]] = action1(thingofinterest, limits, i)
  }
}
talat
  • 68,970
  • 21
  • 126
  • 157
  • @JoshuaUlrich, I know, that's because I removed the `data.frame` call as mentioned at the top.. I was assuming this might be acceptable for faster computation – talat Oct 09 '15 at 12:22
  • 1
    Removing the `data.frame` call accounts for nearly all of that speedup. – Joshua Ulrich Oct 09 '15 at 12:22
  • I guess if you test with a larger data set, the initialization of `test` will gain importance. And generally, that might still be an interesting insight to the OP I guess – talat Oct 09 '15 at 12:24
  • The prior initialization of `test` is interesting, although in my actual problem (not the example provided), I don't know the length of `limits` ahead of time (it depends on a number of other features). I'll see if I can work this into the reproducible example. Ultimately, I want to do away with the for loop entirely. Even with this speed up, applying it to my larger dataset still takes too long. – Nigel Stackhouse Oct 09 '15 at 12:29
2

Just to add a comparison point with *apply familly I used this code (results verified with identical(f1(),f2()) f3 return a different layout).

After tests, the which call give some speed increase on large tingofinterest vector.

thingofinterest = c(1:100000)
limits = c(50:1000)

action1 = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  thirdtask = mean(firsttask)
  secondtask = min(firsttask[which(firsttask>thirdtask)])
  fourthtask = length(firsttask)
  list(firsttask, min_over_mean=secondtask,
                                      average=thirdtask,
                                      size=fourthtask)
}

f1 <- function() {
  test = vector("list", length = length(limits))
  for(i in 1:length(limits)) {
    test[[i]] = action1(thingofinterest, limits, i)
  }
  return(test)
}


action2 <- function(x,y) {
  firsttask = cumsum(x[which(x<y)])
  thirdtask = mean(firsttask)
  secondtask = min(firsttask[which(firsttask>thirdtask)])
  fourthtask = length(firsttask)
  list(firsttask, min_over_mean=secondtask,
                  average=thirdtask,
                  size=fourthtask)
}

f2 <- function() {
  test <- lapply(limits,action2,x=thingofinterest)
  return(test)
}

f3 <- function() {
  test <- sapply(limits,action2,x=thingofinterest)
  return(test)
}

For 1M thingofinterest and 950 limits here is the results on my machine:

> microbenchmark(f1(),f2(),f3(),times=10)
Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 f1() 4.303302 4.336767 4.373119 4.380383 4.403434 4.441945    10
 f2() 4.267922 4.327208 4.450175 4.399422 4.423191 5.041011    10
 f3() 4.240551 4.293855 4.412548 4.362949 4.468117 4.730717    10

So a cleanly done for loop is not that bad in this case.

I've the feeling there's probably a data.table way to do the "action" work in one pass but it's out of my knowledge area for now.

More on the speed topic, I see no way to really vectorize it. Those vectors not being subsets of each others, their cumsum can't be "cut" to avoid computing the common sequences.

As you say in comments limits are usually between 90 and 110 entries, parallel processing could be a correct approach to compute each iteration on a different core as each iteration is independent of the others. (Thinkink of mclapply, but there's maybe others more adapted to your use case)

Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • I was hoping for something like a data.table approach, but clearly that would be added effort for little return. Clearly, an efficient for loop works for the problem at hand just well, thanks for the thorough explanation. – Nigel Stackhouse Oct 09 '15 at 13:58