1

I am passing some data.tables to a function and want to collect the growing results in the passed data.tables over multiple function calls. The rows are added (appended) within the function.

Is there a way to append rows to a data.table "by reference/inplace"?

Any workaround if this is not possible?

Edit: My goal is adding multiple rows at once within the function and the number of rows can be very big (that's why I am using a 'data.table').

library(data.table)

validate <- function(data, rule, valid.result, checked.rules) {
  # ... find errors

  # How to append "rule" to "checked.rules"?

  findings <- data.table(err.code = rule$rule.id, msg = "some blah blah")  # just an stupid example
  # How to append all "finding"s to "valid.results"?
}

data          <- data.table(a=1:10, b=21:30)
valid.result  <- data.table(err.code = integer(0), msg       = character(0))  # empty validation results table
checked.rules <- data.table(rule.id  = integer(0), rule.name = character(0))  # empty table
rules         <- data.table(rule.id  = 1:4,        rule.name = c("too big", "too small", "too late", "empty"))

validate(data, rules[3, ], valid.result, checked.rules)
validate(data, rules[1, ], valid.result, checked.rules)
validate(data, rules[4, ], valid.result, checked.rules)

Expected results:

checked.rules
# rule.id  rule.name
# 1:       3  too late
# 2:       1   too big
# 3:       4     empty

valid.results
# err.code  msg
# 1:        3 some blah blah
# 2:        1 some blah blah
# 3:        4 some blah blah
R Yoda
  • 8,358
  • 2
  • 50
  • 87
  • 2
    Related: [Add a row by reference at the end of a data.table object](https://stackoverflow.com/questions/16792001/add-a-row-by-reference-at-the-end-of-a-data-table-object) – Henrik Feb 01 '19 at 17:23
  • Good link! What I want to do is not only add one row but multiple rows in big `data.table`s (should have written this in the question) and the liked question adds one row (or row per row in a loop) which smells inperformant. I will try too benchmark this... – R Yoda Feb 01 '19 at 17:28
  • See also: [4 different methods of appending multiple rows in place when the number of rows is not known in advance](http://stackoverflow.com/questions/20689650/how-to-append-rows-to-an-r-data-frame/38052208#38052208) – R Yoda Feb 01 '19 at 23:13

2 Answers2

1

As already mentioned in the link provided by @Henrik currently data.tables can't add rows by reference. Accordingly I'd go with rbindlist (which also works just fine to add multiple rows):

library(data.table)

validate <- function(data, rule, valid.result, checked.rules) {
  # ... find errors

  # How to append "rule" to "checked.rules"?
  checked.rules <<- rbindlist(list(checked.rules, rule))

  findings <- data.table(err.code = rule$rule.id, msg = "some blah blah")  # just an stupid example
  # How to append all "finding"s to "valid.results"?
  valid.result <<- rbindlist(list(valid.result, findings))
}

data          <- data.table(a=1:10, b=21:30)
valid.result  <- data.table(err.code = integer(0), msg       = character(0))  # empty validation results table
checked.rules <- data.table(rule.id  = integer(0), rule.name = character(0))  # empty table
rules         <- data.table(rule.id  = 1:4,        rule.name = c("too big", "too small", "too late", "empty"))

validate(data, rules[3, ], valid.result, checked.rules)
validate(data, rules[1, ], valid.result, checked.rules)
validate(data, rules[4, ], valid.result, checked.rules)

print(checked.rules)
print(valid.result)
ismirsehregal
  • 30,045
  • 5
  • 31
  • 78
  • I am sure this works but relying on a "global" variable that is changed within a function via `<<-` is really my last resort. – R Yoda Feb 01 '19 at 22:44
  • 1
    Yes, it would be better to return both tables in a list. This was just a lazy way to get the expected result. – ismirsehregal Feb 01 '19 at 22:51
  • The problem with a `list` is: You can return multiple `data.tables`, but if you pass this `list` into the next function call and update ("rbindlist") the list inside the function this update is not visible from outside the function due to "copy on write". Returning the new list requires to store the return value and pass it into the next function call what would work of course but "clutters" the code with a return value variable. OTOH this was a functional solution then... – R Yoda Feb 01 '19 at 23:07
  • Yes, you are right. I guess for readability it would be better to split things up into two functions and assign the results to each data.table. – ismirsehregal Feb 01 '19 at 23:41
1

After reading the links in the comments and the proposal of @ismirsehregal to use a list I ended up with using an environment so that I can collect multiple results "by reference".

I have done a benchmark for two variants:

  1. rbind the intermediate result at the end of each function call into the "cumulative" result ("append within the function").

  2. collect the intermediate results for each function call and rbindlist only once at the end ("append outside of the function").

The code is simplified resulting in abt. 9 mio rows after 20 function calls:

library(data.table)
library(microbenchmark)

validate.rbind <- function(data, results) {
  findings <- data.table(err.code = 100, msg = rep("some blah blah", sample(1E6, 1) + 1))  # just an stupid example
  results$valid.result <- rbind(results$valid.result, findings) # same as: rbindlist(list(results$valid.result, findings))
}

validate.rbindlist <- function(data, results) {
  findings <- data.table(err.code = 100, msg = rep("some blah blah", sample(1E6, 1) + 1))  # just an stupid example
  assign(paste0("res", sprintf("%02d", results$counter)), findings, envir = results)
  results$counter = results$counter + 1
}

microbenchmark(
  rbind.per.call = {
    set.seed(0815)   # make random numbers reproducible
    data                 <- data.table(a=1:100, b=21:30)
    results              <- new.env()   # use an environment to pass arguments by reference
    results$valid.result <- data.table(err.code = integer(0), msg = character(0))  # empty validation results table
    for (i in 1:20) {
      validate.rbind(data, results)
    }
  },
  rbindlist.once = {
    set.seed(0815)   # make random numbers reproducible
    data                 <- data.table(a=1:100, b=21:30)
    results              <- new.env()   # use an environment to pass arguments by reference
    results$counter      <- 1
    for (i in 1:20) {
      validate.rbindlist(data, results)
    }
    result.vars <- ls(envir = results, pattern = "^res.*")  # identify the result tables via the used naming pattern
    results$valid.result <- rbindlist(mget(result.vars, envir = results))
    rm(list = result.vars, envir = results)  # remove the intermediate result tables (keep only the total result)
  },
  times = 10)

Solution 2 is four times faster

Unit: milliseconds
           expr       min        lq      mean    median        uq       max neval
 rbind.per.call 1021.2956 1114.8187 1198.7033 1153.7775 1324.6672 1477.5669    10
 rbindlist.once  231.0477  249.7195  305.0974  260.2499  275.3446  713.1155    10

and the memory footprint (observed with gc()) is even better:

# Memory consumption for rbind.per.call:
#            used (Mb)  gc trigger  (Mb) max used  (Mb)
# Ncells   510152  27.3     940480  50.3   847768  45.3
# Vcells 19636460 149.9   55027624 419.9 52254173 398.7

# Memory consumption for rbindlist.once:
#            used (Mb)  gc trigger  (Mb) max used  (Mb)
# Ncells   604335  32.3    1168576  62.5   940480  50.3
# Vcells 19859703 151.6   55503896 423.5 39082073 298.2

PS: I did not test the linked set variation since I don't expect a better performance and because it is more complicated to use

R Yoda
  • 8,358
  • 2
  • 50
  • 87
  • Interesting wrap up! So in the end you renounced the approach of appending data within the function as requested initially. – ismirsehregal Feb 04 '19 at 08:18
  • 1
    @ismirsehregal Good point, solution one works as I have asked for (appending data within the function), but solution 2 is suprisingly better (not only regarding the performance but also regarding the memory footprint). In practice solution 2 should be encapsulated in an R6 class to hide the internals (even hiding the fact the the "append" is lazy until you query the result) – R Yoda Feb 04 '19 at 08:25