7

I am exploring different ways to wrap an aggregation function (but really it could be any type of function) using data.table (one dplyr example is also provided) and was wondering on best practices for functional programming / metaprogramming with respect to

  • performance (does the implementation matter with respect to potential optimization that data.table may apply)
  • readability (is there a commonly agreed standard e.g. in most packages utilizing data.table)
  • ease of generalization (are there differences in the way metaprogramming is "generalizable")

The basic application is to aggregate a table flexibly, i.e. parameterizing the variables to aggregate, the dimensions to aggregate by, the respective resulting variable names of both and the aggregation function. I have implemented (nearly) the same function in three data.table and one dplyr way:

  1. fn_dt_agg1 (here I couldn't figure out how parameterize the aggregation function)
  2. fn_dt_agg2 (inspired by @jangorecki 's answer here which he calls "computing on the language")
  3. fn_dt_agg3 (inspired by @Arun 's answer here which seems to be another approach of metaprogramming)
  4. fn_df_agg1 (my humble approach of the same in dplyr)

libraries

library(data.table)
library(dplyr)

data

n_size <- 1*10^6
sample_metrics <- sample(seq(from = 1, to = 100, by = 1), n_size, rep = T)
sample_dimensions <- sample(letters[10:12], n_size, rep = T)
df <- 
  data.frame(
    a = sample_metrics,
    b = sample_metrics,
    c = sample_dimensions,
    d = sample_dimensions,
    x = sample_metrics,
    y = sample_dimensions,
    stringsAsFactors = F)

dt <- as.data.table(df)

implementations

1. fn_dt_agg1

fn_dt_agg1 <- 
  function(dt, metric, metric_name, dimension, dimension_name) {

  temp <- dt[, setNames(lapply(.SD, function(x) {sum(x, na.rm = T)}), 
                        metric_name), 
             keyby = dimension, .SDcols = metric]
  temp[]
}

res_dt1 <- 
  fn_dt_agg1(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"))

2. fn_dt_agg2

fn_dt_agg2 <- 
  function(dt, metric, metric_name, dimension, dimension_name,
           agg_type) {

  j_call = as.call(c(
    as.name("."),
    sapply(setNames(metric, metric_name), 
           function(var) as.call(list(as.name(agg_type), 
                                      as.name(var), na.rm = T)), 
           simplify = F)
    ))

  dt[, eval(j_call), keyby = dimension][]
}

res_dt2 <- 
  fn_dt_agg2(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"),
    agg_type = c("sum"))

all.equal(res_dt1, res_dt2)
#TRUE

3. fn_dt_agg3

fn_dt_agg3 <- 
  function(dt, metric, metric_name, dimension, dimension_name, agg_type) {

  e <- eval(parse(text=paste0("function(x) {", 
                              agg_type, "(", "x, na.rm = T)}"))) 

  temp <- dt[, setNames(lapply(.SD, e), 
                        metric_name), 
             keyby = dimension, .SDcols = metric]
  temp[]
}

res_dt3 <- 
  fn_dt_agg3(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"), 
    agg_type = "sum")

all.equal(res_dt1, res_dt3)
#TRUE

4. fn_df_agg1

fn_df_agg1 <-
  function(df, metric, metric_name, dimension, dimension_name, agg_type) {

    all_vars <- c(dimension, metric)
    all_vars_new <- c(dimension_name, metric_name)
    dots_group <- lapply(dimension, as.name)

    e <- eval(parse(text=paste0("function(x) {", 
                                agg_type, "(", "x, na.rm = T)}")))

    df %>%
      select_(.dots = all_vars) %>%
      group_by_(.dots = dots_group) %>%
      summarise_each_(funs(e), metric) %>%
      rename_(.dots = setNames(all_vars, all_vars_new))
}

res_df1 <- 
  fn_df_agg1(
    df = df, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"),
    agg_type = "sum")

all.equal(res_dt1, as.data.table(res_df1))
#"Datasets has different keys. 'target': c, d. 'current' has no key."

benchmarking

Just out of curiosity and for my future self and other interested parties, I ran a benchmark of all 4 implementations which potentially already sheds light on the performance issue (although I'm not a benchmarking expert so please excuse if I haven't applied commonly agreed best practices). I was expecting fn_dt_agg1 to be the fastest as it has one parameter less (aggregation function) but that doesn't seem to have a sizable impact. I was also surprised by the relatively slow dplyr function but this may be due to a bad design choice on my end.

library(microbenchmark)
bench_res <- 
  microbenchmark(
    fn_dt_agg1 = 
      fn_dt_agg1(
      dt = dt, metric = c("a", "b"), 
      metric_name = c("a", "b"), 
      dimension = c("c", "d"), 
      dimension_name = c("c", "d")), 
    fn_dt_agg2 = 
      fn_dt_agg2(
        dt = dt, metric = c("a", "b"), 
        metric_name = c("a", "b"), 
        dimension = c("c", "d"), 
        dimension_name = c("c", "d"),
        agg_type = c("sum")),
    fn_dt_agg3 =
      fn_dt_agg3(
        dt = dt, metric = c("a", "b"), 
        metric_name = c("a", "b"),
        dimension = c("c", "d"), 
        dimension_name = c("c", "d"),
        agg_type = c("sum")),
    fn_df_agg1 =
      fn_df_agg1(
        df = df, metric = c("a", "b"), metric_name = c("a", "b"),
        dimension = c("c", "d"), dimension_name = c("c", "d"),
        agg_type = "sum"),
    times = 100L)

bench_res

# Unit: milliseconds
#       expr      min       lq     mean   median       uq       max neval
# fn_dt_agg1 28.96324 30.49507 35.60988 32.62860 37.43578 140.32975   100
# fn_dt_agg2 27.51993 28.41329 31.80023 28.93523 33.17064  84.56375   100
# fn_dt_agg3 25.46765 26.04711 30.11860 26.64817 30.28980 153.09715   100
# fn_df_agg1 88.33516 90.23776 97.84826 94.28843 97.97154 172.87838   100

other resources

Community
  • 1
  • 1
Triamus
  • 2,415
  • 5
  • 27
  • 37
  • re: agg2 "which he calls 'computing on the language'" - not me but official R lang definition which you linked at the bottom. – jangorecki Dec 29 '16 at 17:39
  • 1
    @Triamus You may check [data.table v1.14.1`devel`](https://github.com/Rdatatable/data.table/blob/master/NEWS.md#new-features), item 10: "A new interface for programming on data.table has been added" – Henrik Jul 31 '21 at 05:22

1 Answers1

6

I don't recommend eval(parse()). You can achieve the same as in approach three without it:

fn_dt_agg4 <- 
  function(dt, metric, metric_name, dimension, dimension_name, agg_type) {

    e <- function(x) getFunction(agg_type)(x, na.rm = T)

    temp <- dt[, setNames(lapply(.SD, e), 
                          metric_name), 
               keyby = dimension, .SDcols = metric]
    temp[]
  }

This also avoids some security risks.

PS: You can check what data.table is doing regarding optimizations by setting options("datatable.verbose" = TRUE).

Roland
  • 127,288
  • 10
  • 191
  • 288
  • Is there an important difference between `getFunction` and `match.fun`? – Axeman Dec 29 '16 at 09:19
  • nice. I didn't know about `getFunction`. haven't seen it anywhere else so far. but why would `eval(parse))` not be recommended? I had seen it in other answers from @Matt Dowle [here](http://stackoverflow.com/questions/10675182/in-r-data-table-how-do-i-pass-variable-parameters-to-an-expression) and @Arun [here](http://stackoverflow.com/questions/26883859/using-eval-in-data-table?rq=1) – Triamus Dec 29 '16 at 09:25
  • 1
    @Axeman I don't know. The latter allows input other than characters. – Roland Dec 29 '16 at 09:27
  • 2
    @Triam In the first post it isn't from Matt, in the second post Arun refers to the question which uses it. R allows computing on the language, so you don't need it. Eval/parsing arbitrary expressions adds security risks, can be slow (not in the example here), and is impossible to debug. – Roland Dec 29 '16 at 09:29
  • @Roland I see. So you would argue that fn_dt_agg4 is true computation on the language while fn_dt_agg3 is not? Do you have an opinion on fn_dt_agg2? – Triamus Dec 29 '16 at 09:35
  • @Roland in Matt's edit, he was mentioning "If on the other hand many queries are being constructed dynamically and repeated then the time to paste0 and parse might come into it." That didn't seem like a strong advice against using it. – Triamus Dec 29 '16 at 09:40
  • 1
    Yes, eval/parsing is not computing on the language. If you consider using `parse` reread fortune 106. There are [valid uses](http://stackoverflow.com/a/40612539/1412059) of `parse`, but they are rare. – Roland Dec 29 '16 at 09:50