3

I am writing a custom aggregation function with data.table (v 1.9.6) and struggle to pass function arguments to it. there have been similar questions on this but none deals with multiple (variable) inputs and none seems to have a conclusive answer but rather "little hacks".

  1. pass variables and names to data.table function
  2. eval and quote in data.table
  3. How can one work fully generically in data.table in R with column names in variables

I would like to take a data table sum and order defined variables and create new variables on top (2 steps). the crucial think is that everything should be parameterized i.e. variables to sum, variables to group by, variables to order by. and they can all be one or more variables. a small example.

dt <- data.table(a=rep(letters[1:4], 5), 
                 b=rep(letters[5:8], 5),
                 c=rep(letters[3:6], 5),
                 x=sample(1:100, 20),
                 y=sample(1:100, 20),
                 z=sample(1:100, 20))

temp <- 
  dt[, .(x_sum = sum(x, na.rm = T),
         y_sum = sum(y, na.rm = T)),
     by = .(a, b)][order(a, b)]

temp2 <- 
  temp[, `:=` (x_sum_del = (x_sum - shift(x = x_sum, n = 1, type = "lag")),
               y_sum_del = (y_sum - shift(x = y_sum, n = 1, type = "lag")),
               x_sum_del_rel = ((x_sum - shift(x = x_sum, n = 1, type = "lag")) /
                                  (shift(x = x_sum, n = 1, type = "lag"))),
               y_sum_del_rel = ((y_sum - shift(x = y_sum, n = 1, type = "lag")) /
                                  (shift(x = y_sum, n = 1, type = "lag")))
               )
       ]

how to programmatically pass following function arguments (i.e. not single inputs but vectors/list of inputs):

  • x and y --> var_list
  • new names of x and y (e.g. x_sum, y_sum) --> var_name_list
  • group by arguments a, b --> by_var_list
  • order by arguments a, b --> order_var_list
  • temp 2 should work on all pre-defined parameters, I was also thinking about using an apply function but again struggled to pass a list of variables.

I have played around with variations of get(), as.name(), eval(), quote() but as soon as I pass more than one variable, they don't work anymore. I hope the question is clear, otherwise I am happy to adjust where you deem necessary. a function call would look as follows:

fn_agg(dt, var_list, var_name_list, by_var_list, order_var_list)
Community
  • 1
  • 1
Triamus
  • 2,415
  • 5
  • 27
  • 37
  • how does mget help with variable input length? e.g. imagine I would like to sum (x, y, z) rather than only (x, y). then table temp will not work as it currently only takes static input of two vars. – Triamus Aug 24 '16 at 12:58

2 Answers2

3

Looks like a question to me :)
I prefer computing on the language over get/mget.

fn_agg = function(dt, var_list, var_name_list, by_var_list, order_var_list) {
    j_call = as.call(c(
        as.name("."),
        sapply(setNames(var_list, var_name_list), function(var) as.call(list(as.name("sum"), as.name(var), na.rm=TRUE)), simplify=FALSE)
    ))
    order_call = as.call(c(
        as.name("order"),
        lapply(order_var_list, as.name)
    ))
    j2_call = as.call(c(
        as.name(":="),
        c(
            sapply(setNames(var_name_list, paste0(var_name_list,"_del")), function(var) {
                substitute(.var - shift(x = .var, n = 1, type = "lag"), list(.var=as.name(var)))
            }, simplify=FALSE),
            sapply(setNames(var_name_list, paste0(var_name_list,"_del_rel")), function(var) {
                substitute((.var - shift(x = .var, n = 1, type = "lag")) / (shift(x = .var, n = 1, type = "lag")), list(.var=as.name(var)))
            }, simplify=FALSE)
        )
    ))
    dt[eval(order_call), eval(j_call), by=by_var_list
       ][, eval(j2_call)
         ][]
}

ans = fn_agg(dt, var_list=c("x","y"), var_name_list=c("x_sum","y_sum"), by_var_list=c("a","b"), order_var_list=c("a","b"))
all.equal(temp2, ans)
#[1] TRUE

Some extra notes:

  1. make strict input validation as debugging issues is more difficuilt against meta programming.
  2. optimization of step2 is possible as shift is computed multiple times, easy way is just to compute _del in step2 and _del_rel in step3.
  3. if order variables is always the same as by variables you can put them into keyby argument.
jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • I will need some time to digest this and apply to my much bigger problem. I was actually trying to find hints on implementation @https://github.com/jangorecki/dwtools/ but failed. – Triamus Aug 24 '16 at 13:19
  • I have accepted @docendo discimus answer as it comes more natural to my thinking but it seems your answer is more generic and gives optimization potential. I will try to get my head around it and apply it to my problem. many thanks for extensive answer! – Triamus Aug 24 '16 at 16:02
1

Here's an option using mget, as commented:

fn_agg <- function(DT, var_list, var_name_list, by_var_list, order_var_list) {

  temp <- DT[, setNames(lapply(.SD, sum, na.rm = TRUE), var_name_list), 
             by = by_var_list, .SDcols = var_list]

  setorderv(temp, order_var_list)

  cols1 <- paste0(var_name_list, "_del")
  cols2 <- paste0(cols1, "_rel")

  temp[, (cols1) := lapply(mget(var_name_list), function(x) {
    x - shift(x, n = 1, type = "lag")
  })]

  temp[, (cols2) := lapply(mget(var_name_list), function(x) {
    xshift <- shift(x, n = 1, type = "lag")
    (x - xshift) / xshift
  })]

  temp[]
}

fn_agg(dt, 
       var_list = c("x", "y"), 
       var_name_list = c("x_sum", "y_sum"), 
       by_var_list = c("a", "b"), 
       order_var_list = c("a", "b"))

#   a b x_sum y_sum x_sum_del y_sum_del x_sum_del_rel y_sum_del_rel
#1: a e   254   358        NA        NA            NA            NA
#2: b f   246   116        -8      -242  -0.031496063    -0.6759777
#3: c g   272   242        26       126   0.105691057     1.0862069
#4: d h   273   194         1       -48   0.003676471    -0.1983471

Instead of mget, you could also make use of data.table's .SDcols argument as in

temp[, (cols1) := lapply(.SD, function(x) {
    x - shift(x, n = 1, type = "lag")
  }), .SDcols = var_name_list]

Also, there are probably ways to improve the function by avoiding duplicated computation of shift(x, n = 1, type = "lag") but I only wanted to demonstrate a way to use data.table in functions.

talat
  • 68,970
  • 21
  • 126
  • 157
  • I will need some time to digest this and compare against @jangorecki answer. but your implementation comes close to the things I have tried but I was missing some parts. agree on the point around efficiency for lag function but that was my smallest concern :-) – Triamus Aug 24 '16 at 13:23