2

A while ago I read two posts talking about the apply family of functions in R and if they are "really" vectorized and whether they improve execution time and/or memory usage. Both posts center around the idea that "under the hood" apply functions are loops, with the difference being that a for loop goes through the loop in R where as the applys loop in C. This brings to particular interest vapplying primitive functions like sum which "make no use of R code" meaning

vapply(x, sum, numeric(1))

should show a particular advantage since it is executed entirely in C. Sure enough, there is a pretty big speed up seen calling sum() directly in an apply versus calling it in a R wrapper. (Note: all expr outputs from microbenchmark are removed for readability)

> library(microbenchmark)
> set.seed(100)
> mydata <- as.data.frame(matrix(runif(5000000), ncol = 10000))
> microbenchmark(vapply(mydata, sum, numeric(1)))
Unit: milliseconds
       min       lq     mean   median       uq      max neval
  10.79918 11.16559 12.33968 11.45679 11.67717 34.89015   100
> microbenchmark(vapply(mydata, function(x) sum(x), numeric(1)))
Unit: milliseconds
      min       lq     mean   median       uq      max neval
 13.92961 14.20497 15.62257 14.33341 14.64748 32.71982   100

This got me wondering, when is it better to call an apply function multiple times with primitive functions vs when is it better to call it once with an R wrapper. For instance to compute means:

> microbenchmark(vapply(mydata, mean, numeric(1)))
Unit: milliseconds
      min       lq     mean  median       uq      max neval
 88.90804 96.81667 112.2362 108.323 117.7397 585.6286   100
> microbenchmark(colMeans(mydata))
Unit: milliseconds
      min       lq     mean   median       uq      max neval
 53.13873 57.16753 64.98141 59.10488 73.67025 100.2248   100
> microbenchmark(vapply(mydata, sum, numeric(1))/vapply(mydata, length, numeric(1)))
Unit: milliseconds
     min       lq     mean   median       uq      max neval
 15.1812 15.74453 17.59638 16.08719 16.97016 34.12331   100
> microbenchmark(vapply(mydata, function(x) sum(x)/length(x), numeric(1)))
Unit: milliseconds
      min       lq     mean   median       uq      max neval
 16.09877 17.24127 20.73641 18.23274 20.59589 58.83919   100

I assume the longer run time for mean() and colMeans() has to do with argument checking rather than it not being a primitive (correct me if I am wrong) (@Roland corrected me, for colMeans() it is because it must convert the data.frame to a matrix). The speed up for calling the primitives sum() and length() independently rather than as function(x) sum(x)/length(x) was more interesting. The difference isn't big, but it is consistently there across the microbenchmark distribution. This made me wonder about calling anonymous functions where you mix primitives and non-primitive functions, so I ran:

> microbenchmark(vapply(mydata, function(x) sum(x)/sd(x), numeric(1)))
Unit: milliseconds
      min       lq     mean  median       uq      max neval
 161.7666 207.5839 220.5956 213.863 231.5117 623.3246   100
> microbenchmark(vapply(mydata, sum, numeric(1))/vapply(mydata, sd, numeric(1)))
Unit: milliseconds
      min       lq     mean   median       uq      max neval
 164.1759 195.0166 212.6117 207.3877 219.0822 640.5814   100

Once again, not a huge difference, but an improvement is there in all but the min and max of the distribution. Finally, I decided to do one last experiment where I called two non-primitives:

> microbenchmark(vapply(mydata, function(x) mean(x)/sd(x), numeric(1)))
Unit: milliseconds
      min       lq    mean  median       uq      max neval
 290.6148 337.9683 358.855 356.099 368.3152 840.7395   100
> microbenchmark(vapply(mydata, mean, numeric(1))/vapply(mydata, sd, numeric(1)))
Unit: milliseconds
      min       lq     mean   median       uq      max neval
 246.9073 287.1489 303.7743 300.0357 319.8356 400.5652   100

This one really surprised me since both functions are R functions, so it seems like it would be better to loop once calling R functions than to loop twice calling R functions.

So for my questions:

  1. Why am I getting faster run times from calling vapply() twice on simpler functions than once on a slightly more complex one?
  2. In each of these cases, where is the speed up coming from; avoiding R functions by calling applys directly on primitives or splitting up the function into multiple applys?
  3. Should I always split up my function calls into multiple simpler applys rather than one complex one (even if it means 7 calls to an apply function) and what makes a function "simple" enough to be worth it?
Community
  • 1
  • 1
Barker
  • 2,074
  • 2
  • 17
  • 31
  • 1
    I'm pretty sure David doesn't consider a vapply loop vectorized. vapply's better speed results from the known information of the results data structure. colMeans is slower because it calls as.matrix.data.frame in the beginning. – Roland Dec 13 '16 at 19:47
  • @Roland You are correct that in his answer David said he didn't feel `vapply()` is vectorized, but when it is called on a primitive, as far as I can tell it fits his definition. Am I missing something? – Barker Dec 13 '16 at 19:55
  • No, I don't consider `vapply` vectorized. I don't think you understood my answer at all. – David Arenburg Dec 13 '16 at 21:03
  • @DavidArenburg I removed the reference to your answer since you felt it was inappropriate and it was more of a motivation for the question than anything else. I did not mean to offend. Hopefully you will continue to engage with me on the answer I had previously linked because I understand why you don't feel `vapply(x, mean, numeric(1))` is vectorized but I don't understand why `vapply(x, sum, numeric(1))` isn't based on my understanding of how `R` calls its compiled code. – Barker Dec 14 '16 at 00:15
  • 1
    Look into overhead from S3 method dispatch to understand why `mean` is slower. Then, `.Primitive` functions are slightly more efficient than closures, but the difference is smaller than between generics and non-generics. You could also look into byte-compiling. Also, don't confuse `vapply` and `apply`. And finally, vectorized code is *at least* as fast as an R loop; usually much faster because you avoid overhead from repeated R function calls. – Roland Dec 14 '16 at 12:42
  • Thanks @Roland, I think you helped me understand pretty well why the `mean()` call is slower. Did you address why two `vapply()`s (both `C` loops) are faster than one `vapply()` that calls the same functions with an anonymous function? I didn't see anything in your answer, but I may just not have understood it. – Barker Dec 14 '16 at 18:05
  • length(mydata) calls to / vs one call to / could be part of the reason. Benchmark /. – Roland Dec 14 '16 at 18:09
  • @Roland good point, not quite sure the right way to do that though. I tried `var2 <- mydata[1, ]` and `system.time(vapply(var2 , "/", numeric(1), 1))` which gave me `user: 0.51, system: 0.95, elapsed: 0.51` which is more than the difference when calling the primitives and less than the difference for the other two, so I am not sure if that means I set up the experiment wrong or that we can reject that hypothesis. – Barker Dec 14 '16 at 18:29
  • Use a package like microbenchmark for proper benchmarking. – Roland Dec 14 '16 at 18:31
  • Thanks @Roland I changed the benchmarks in my question and reduced the sample data size to keep the benchmark run times reasonable. When I did `microbenchmark` on the old data set, I got similar times as in my previous comment. With the new data set it is `mean: 4.237, median: 4.188`. – Barker Dec 14 '16 at 23:02

0 Answers0