4

Take the following simple function:

fun <- function(a, b, c, d, e) {
   
   stopifnot("Input you provide must be equal length." = length(a) == length(b) && length(b) == length(c) && length(c) == length(d) && length(d) == length(e))

   result <- (a + b / c + d) / sqrt(e)
   
   result2 <- a/result

   return(data.frame(result = result, result2 = result2, a = a, b = b, c = c, d = d, e = e))
}

Now, if I want to map over a look-up table of all combinations of input values, I could do the following, e.g., using purrr functionals:

library(purrr)

df <- expand.grid(a = 1:1000, b = c(1, 2, 3, 4, 5), c = 7, d = 3, e = 5)

out <- pmap_df(d, fun)

However, even for the relatively simple case of one larger and a smaller vector (in my application case, this would be the most common case though), this is pretty slow.

Unit: seconds                                                                             
      min       lq     mean   median       uq      max neval
 2.235245 2.235245 2.235245 2.235245 2.235245 2.235245     1

How to speed this up, especially for the simple case sketched above? Of course, as df gets larger and larger, things will slow down.

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
Rob G.
  • 395
  • 1
  • 8
  • 1
    `pmap_df(d, function(a, b, c, d, e) fun(a = a, b = b, c = c, d = d, e = e))` is more conventionally written as `pmap_df(df, fun)`. In a similar vein, [the `return` function call is unnecessary](https://stackoverflow.com/a/59090751/1968). – Konrad Rudolph Aug 09 '21 at 12:46
  • Fully agree on the first part. I pasted stuff from a situation in my script where it was ok. The second part, however, is pretty subjective (although you made a nice case). There are also lots of OK reasons to just use return (e.g., ease readability for non-r-native co-authors). – Rob G. Aug 09 '21 at 19:41
  • I would caution against “dumbing down” code for non-native “speakers”. There’s simply no substitute for learning the language. Sure, code shouldn’t be “clever”, and it should be self-documenting and clear. But it also should be idiomatic, and play to the strengths of the language it’s written in. – Konrad Rudolph Aug 09 '21 at 20:52

2 Answers2

9

I cannot say my solution is fastest, but it is faster indeed. You can try the code below

do.call(fun, df)

and the benchmarking

df <- expand.grid(a = 1:1000, b = c(1, 2, 3, 4, 5), c = 7, d = 3, e = 5)


f_Rob <- function() pmap_df(df, function(a, b, c, d, e) fun(a = a, b = b, c = c, d = d, e = e))
f_TIC <- function() do.call(fun, df)

microbenchmark(
  f_Rob(),
  f_TIC(),
  unit = "relative",
  check = "equivalent",
  times = 10
)

and you will see

Unit: relative
    expr      min       lq     mean   median       uq      max neval
 f_Rob() 1074.886 1049.034 441.6319 854.2739 620.4029 92.29739    10
 f_TIC()    1.000    1.000   1.0000   1.0000   1.0000  1.00000    10
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • Could you add a little explainer as to why this is so much faster than `pmap_df(df, fun)`? I guess "something, something, R's internal copyshop"? Still interested in hearing other approaches. – Rob G. Aug 09 '21 at 12:53
  • @RobG. Sorry that I didn't use `pmap_df` before. but I think your `fun` is already vectorized. What you need to do is just push all the argument to the function. – ThomasIsCoding Aug 09 '21 at 12:55
  • No problem, yes, makes sense! – Rob G. Aug 09 '21 at 12:57
  • 1
    @Adam yes, I used `df` instead of `d` to avoid naming collisions. – ThomasIsCoding Aug 09 '21 at 13:01
  • 1
    @Adam changed it. I think I used pmap instead bc some part in my use case is not vectorized. But this definitely helped me to think through it more clearly/showed me where to vectorize further! – Rob G. Aug 09 '21 at 13:12
4

I think the most direct tidyverse equivalent to this would be to use exec() from rlang.

This is not quicker than do.call() and I don't see clear advantages outside of advanced cases, but here it is.

library(rlang)

df <- expand.grid(a = 1:1000, b = c(1, 2, 3, 4, 5), c = 7, d = 3, e = 5)

f_TIC <- function() do.call(fun, df)
f_rlang <- function() exec(fun, !!!df)

microbenchmark::microbenchmark(
  f_rlang(),
  f_TIC(),
  unit = "relative",
  check = "equivalent",
  times = 100
)

About 15% slower.

Unit: relative
      expr      min       lq     mean   median       uq      max neval
 f_rlang() 1.158271 1.149351 1.156371 1.145274 1.143179 1.229871   100
   f_TIC() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100