A while ago I read two posts talking about the apply
family of functions in R
and if they are "really" vectorized and whether they improve execution time and/or memory usage. Both posts center around the idea that "under the hood" apply
functions are loops, with the difference being that a for
loop goes through the loop in R
where as the apply
s loop in C
. This brings to particular interest vapply
ing primitive functions like sum
which "make no use of R code" meaning
vapply(x, sum, numeric(1))
should show a particular advantage since it is executed entirely in C
. Sure enough, there is a pretty big speed up seen calling sum()
directly in an apply
versus calling it in a R
wrapper. (Note: all expr
outputs from microbenchmark
are removed for readability)
> library(microbenchmark)
> set.seed(100)
> mydata <- as.data.frame(matrix(runif(5000000), ncol = 10000))
> microbenchmark(vapply(mydata, sum, numeric(1)))
Unit: milliseconds
min lq mean median uq max neval
10.79918 11.16559 12.33968 11.45679 11.67717 34.89015 100
> microbenchmark(vapply(mydata, function(x) sum(x), numeric(1)))
Unit: milliseconds
min lq mean median uq max neval
13.92961 14.20497 15.62257 14.33341 14.64748 32.71982 100
This got me wondering, when is it better to call an apply
function multiple times with primitive functions vs when is it better to call it once with an R
wrapper. For instance to compute means:
> microbenchmark(vapply(mydata, mean, numeric(1)))
Unit: milliseconds
min lq mean median uq max neval
88.90804 96.81667 112.2362 108.323 117.7397 585.6286 100
> microbenchmark(colMeans(mydata))
Unit: milliseconds
min lq mean median uq max neval
53.13873 57.16753 64.98141 59.10488 73.67025 100.2248 100
> microbenchmark(vapply(mydata, sum, numeric(1))/vapply(mydata, length, numeric(1)))
Unit: milliseconds
min lq mean median uq max neval
15.1812 15.74453 17.59638 16.08719 16.97016 34.12331 100
> microbenchmark(vapply(mydata, function(x) sum(x)/length(x), numeric(1)))
Unit: milliseconds
min lq mean median uq max neval
16.09877 17.24127 20.73641 18.23274 20.59589 58.83919 100
I assume the longer run time for mean()
and colMeans()
has to do with argument checking rather than it not being a primitive (correct me if I am wrong) (@Roland corrected me, for colMeans()
it is because it must convert the data.frame
to a matrix
). The speed up for calling the primitives sum()
and length()
independently rather than as function(x) sum(x)/length(x)
was more interesting. The difference isn't big, but it is consistently there across the microbenchmark
distribution. This made me wonder about calling anonymous functions where you mix primitives and non-primitive functions, so I ran:
> microbenchmark(vapply(mydata, function(x) sum(x)/sd(x), numeric(1)))
Unit: milliseconds
min lq mean median uq max neval
161.7666 207.5839 220.5956 213.863 231.5117 623.3246 100
> microbenchmark(vapply(mydata, sum, numeric(1))/vapply(mydata, sd, numeric(1)))
Unit: milliseconds
min lq mean median uq max neval
164.1759 195.0166 212.6117 207.3877 219.0822 640.5814 100
Once again, not a huge difference, but an improvement is there in all but the min and max of the distribution. Finally, I decided to do one last experiment where I called two non-primitives:
> microbenchmark(vapply(mydata, function(x) mean(x)/sd(x), numeric(1)))
Unit: milliseconds
min lq mean median uq max neval
290.6148 337.9683 358.855 356.099 368.3152 840.7395 100
> microbenchmark(vapply(mydata, mean, numeric(1))/vapply(mydata, sd, numeric(1)))
Unit: milliseconds
min lq mean median uq max neval
246.9073 287.1489 303.7743 300.0357 319.8356 400.5652 100
This one really surprised me since both functions are R
functions, so it seems like it would be better to loop once calling R
functions than to loop twice calling R
functions.
So for my questions:
- Why am I getting faster run times from calling
vapply()
twice on simpler functions than once on a slightly more complex one? - In each of these cases, where is the speed up coming from; avoiding
R
functions by callingapply
s directly on primitives or splitting up the function into multipleapply
s? - Should I always split up my function calls into multiple simpler
apply
s rather than one complex one (even if it means 7 calls to anapply
function) and what makes a function "simple" enough to be worth it?