1

I have large data sets with rows that measure the same thing (essentially duplicates with some noise). As part of a larger function I am writing, I want the user to be able to collapse these rows with a function of their choosing (e.g. mean, median).

My problem is that if I call the function directly, speed is much faster than if I use match.fun (which is what I need). MWE:

require(data.table)

rows <- 100000
cols <- 1000
dat <- data.table(id=sample(LETTERS, rows, replace=TRUE), 
                  matrix(rnorm(rows*cols), nrow=rows))

aggFn <- "median"

system.time(dat[, lapply(.SD, median), by=id])
system.time(dat[, lapply(.SD, match.fun(aggFn)), by=id])

On my system, timing results for the last 2 lines:

   user  system elapsed 
  1.112   0.027   1.141 
   user  system elapsed 
  2.854   0.265   3.121 

This becomes quite dramatic with larger data sets.

As a final point, I realize aggregate() can do this (and doesn't seem to suffer from this behavior), but I need to work with data.table objects due to data size.

Travis Gerke
  • 344
  • 1
  • 10
  • 1
    why not just do `f = match.fun(aggFn)` outside your loop then `lapply(.SD, f)`. I think it's pretty obvious that `match.fun` will be slower than the function itself. Looking at the code for `match.fun`, it's basically running `get` and ensuring that `aggFn` is indeed a function. If you know that `aggFn` is a function already, there's no need to use `match.fun`. – MichaelChirico Jan 12 '17 at 16:59
  • 1
    You say this gets quite dramatic for larger data, but I tried `rows=1e3; cols=1e4` and the percent increase in time actually went down relative to your example... Besides Michael's suggestion, there's also `e = substitute(lapply(.SD, aggFn), list(aggFn = "median")); system.time(dat[, eval(e), by=id])`, mentioned in the data.table FAQ. – Frank Jan 12 '17 at 17:00
  • 1
    Oh, actually, the speed difference is probably also thanks to GForce not being triggered when you're using match.fun. See `?GForce` and try running your queries with `verbose=TRUE`, like `dat[, lapply(.SD, median), by=id, verbose=TRUE]` – Frank Jan 12 '17 at 17:09
  • @MichaelChirico thanks: I actually did try defining the function outside of the loop and got the same timings. The key, as others point out below, is that data.table is implementing gforce for median() which, for this application, seems not to be optimal – Travis Gerke Jan 12 '17 at 18:39
  • Thanks @Frank; interesting that the problem diminishes for more columns relative to rows (unlike my example), but maybe that's to be expected since the operation needs to aggregate rows. Good catch with GForce. – Travis Gerke Jan 12 '17 at 18:41

1 Answers1

3

The reason is the gforce optimization data.table does for median. You can see that if you set options(datatable.verbose=TRUE). See help("GForce") for details.

If you compare for other functions you get more similar timings:

fun <- median
aggFn <- "fun"
system.time(dat[, lapply(.SD, fun), by=id])
system.time(dat[, lapply(.SD, match.fun(aggFn)), by=id])

A possible workaround to utilise the optimization if the function happens to be supported would be evaluating an expression build with it, e.g., using the dreaded eval(parse()):

dat[, eval(parse(text = sprintf("lapply(.SD, %s)", aggFn))), by=id]

However, you would lose the small security using match.fun adds.

If you have a list of functions the users can choose from, you could do this:

funs <- list(quote(mean), quote(median))
fun <- funs[[1]] #select
expr <- bquote(lapply(.SD, .(fun)))
a <- dat[, eval(expr), by=id]
Roland
  • 127,288
  • 10
  • 191
  • 288
  • You could use use substitute instead of parse (not sure if you see that as less bad). Relevant link re eval: https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-faq.html#ok-but-i-dont-know-the-expressions-in-advance.-how-do-i-programatically-pass-them-in – Frank Jan 12 '17 at 17:47
  • great answer @Roland. Glad to learn of GForce and had totally forgotten about eval()! Both solutions worked equally well, with a slight preference for the second due to brevity in the ultimate data.table statement – Travis Gerke Jan 12 '17 at 18:47
  • not sure what your ultimate goal is but this question may be interesting as well: http://stackoverflow.com/questions/41376034/r-data-table-functional-programming-metaprogramming-computing-on-the-languag – Triamus Jan 12 '17 at 21:20
  • @Triam that's an incredible link, and I would advise all future readers of this question to check it. Particularly as the approach I like here is covered by approach 3 there (and then refined a bit in the comments) – Travis Gerke Jan 13 '17 at 03:00