I have large data sets with rows that measure the same thing (essentially duplicates with some noise). As part of a larger function I am writing, I want the user to be able to collapse these rows with a function of their choosing (e.g. mean, median).
My problem is that if I call the function directly, speed is much faster than if I use match.fun (which is what I need). MWE:
require(data.table)
rows <- 100000
cols <- 1000
dat <- data.table(id=sample(LETTERS, rows, replace=TRUE),
matrix(rnorm(rows*cols), nrow=rows))
aggFn <- "median"
system.time(dat[, lapply(.SD, median), by=id])
system.time(dat[, lapply(.SD, match.fun(aggFn)), by=id])
On my system, timing results for the last 2 lines:
user system elapsed
1.112 0.027 1.141
user system elapsed
2.854 0.265 3.121
This becomes quite dramatic with larger data sets.
As a final point, I realize aggregate() can do this (and doesn't seem to suffer from this behavior), but I need to work with data.table objects due to data size.