Updates to data.table when passed as a function argument

Question

When I pass a data.table as an argument to a function, I can update that table 'by reference' within the called function and the results are applied to the original object. However if I do something that requires a 'deep copy' (e.g. rbindlist to add rows) the copy exists only in the called function. The original object remains in the calling frame unchanged.

library(data.table)
l1 <- function(a1, action='update'){
  b <- l2(a1, action)
  print('l1')
  print(a1)
}
l2 <- function(a2, action){
  c <- l3(a2, action)
  print('l2')
  print(a2)
}
l3 <- function(a3, action){
  if (action == 'update') a3[, col2 := col + 1]
  if (action == 'append') a3 <- rbindlist(list(a3, data.table(col = c(21, 22))), fill=TRUE)
  if (action == 'forceupdate') assign('DT', 
                                      rbindlist(list(a3, data.table(col = c(21, 22))), fill=TRUE),
                                      envir = parent.frame(3))
  print('l3')
  print(a3)
  a3
}
DT <- data.table(col = c(1, 2, 3))
print(DT)
#>    col
#> 1:   1
#> 2:   2
#> 3:   3
l1(DT, 'update')
#> [1] "l3"
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4
#> [1] "l2"
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4
#> [1] "l1"
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4
print(DT)
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4

l1(DT, 'append')
#> [1] "l3"
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4
#> 4:  21   NA
#> 5:  22   NA
#> [1] "l2"
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4
#> [1] "l1"
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4
print(DT)
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4

l1(DT, 'forceupdate')
#> [1] "l3"
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4
#> [1] "l2"
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4
#> [1] "l1"
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4
print(DT)
#>    col col2
#> 1:   1    2
#> 2:   2    3
#> 3:   3    4
#> 4:  21   NA
#> 5:  22   NA

^{Created on 2021-04-22 by the reprex package (v1.0.0)}

In this example there is a 3-level stack of function calls. The argument at the first level is passed down the stack and updated in function l3.

Using the data.table update syntax, l3 adds a new column to the object and this results in the original object being changed 'in place' and the results are seen in each level in the calling stack.

However if I add rows, using rbindlist, a copy is made within the frame of l3 and this does not affect the original object, or any view of this in the parent calls.

If I assign the change back to the 'original frame' then it is seen there, but the intervening calls don't see the change.

Is there a way of reflecting the results of this 'deep copy' back up the calling stack?

If assign is the way to go, I would appreciate an example for how to establish the name and environment of the underlying data object so that this assignment can be made without hard coding.

If you always want to use `deep copy`, then just to use `copy(dt)`. — Peace Wang, Apr 22 '21 at 02:09
Thanks Peace. However the problem here is the effect of the deep copy and the frame/environment in which the resulting table ends up. If the copy is done locally then the result cannot be seen by the calling function(s) and if assigned back to the original frame the result cannot be seen by the intermediate functions. I would like it to behave the same was as the update-by-reference. — sch56, Apr 23 '21 at 04:28
I also struggle with this in exactly the situation in which I need to add rows to the data.table, so I'm interested if clever answers show up. Currently I sometimes opt for `<<-` in this situation, which searches progressively upwards into parent environments until it hits global where it will create something. It's imperfect but may meet your needs. — Danielle McCool, Apr 23 '21 at 11:54

score 2 · Answer 1 · answered May 05 '21 at 14:01

(Please don't use this casually in production. :-)

I think that this question is really a duplicate of Insert a row in a data.table, Add a row by reference at the end of a data.table object, and How to delete a row by reference in data.table?. It has been said several times that the ability to add rows by-reference "could" be done, but is not trivial (and has not been done yet). Bottom line, the unicorn data.table::insert function does not exist (yet).

While I dislike the premise of using <<- and assign, here is a *hack* that achieves this ... with huge caveats.

func <- function(.x) {
  nm <- deparse(substitute(.x))
  stopifnot(
    "'.x' must be a whole data.table" =
      nm == make.names(nm)
  )
  env <- parent.frame()
  while (!is.null(env) && !exists(nm, envir=env, inherits=FALSE)) {
    env <- parent.env(env)
  }
  stopifnot(!is.null(env))
  assign(nm, rbindlist(list(.x, .x[1,])), envir=env)
}

Here it is working as intended:

DT <- data.table(col1=1:3,col2=11:13)
DT
#     col1  col2
#    <int> <int>
# 1:     1    11
# 2:     2    12
# 3:     3    13
func(DT)
DT
#     col1  col2
#    <int> <int>
# 1:     1    11
# 2:     2    12
# 3:     3    13
# 4:     1    11

However, if we pass a filtered/subsetted table, we should error:

func(DT[1,])
# Error in func(DT[1, ]) : '.x' must be a whole data.table

Why? Allow me to demonstrate. First, I'm going to modify the func so that we find the original DT, not just the partial:

func <- function(.x) {
  nm <- deparse(substitute(.x))
  nm <- gsub("[[({].*", "", nm) # turns 'DT[1,]' --> 'DT'
  env <- parent.frame()
  # ... everything else here
}

DT <- data.table(col1=1:3,col2=11:13)
func(DT[1,])
DT
#     col1  col2
#    <int> <int>
# 1:     1    11
# 2:     1    11

DT <- data.table(col1=1:3,col2=11:13)
func(DT[,1])
DT
#     col1
#    <int>
# 1:     1
# 2:     2
# 3:     3
# 4:     1

func(copy(DT))
# Error in assign(nm, rbindlist(list(.x, .x[1, ])), envir = env) : 
#   cannot change value of locked binding for 'copy'

All three of those scenarios are plausible in normal R code, where the function might need just a subset of the data. And all three demonstrate the dangers of trying to infer enough about the original object based on the data passed to the function. I'm sure somebody could add some logic to func such that it would catch some of these conditions, but that goes further beyond my level of *hack* than I've already reached with the above.

The point of this exercise was to demonstrate that under ideal circumstances where the caller knows the risks and always passed the whole (not subsetted/filter) data.table, this function will effectively add rows by reference. But it isn't by-ref, as we know, it's just hiding the trickery and memory-bloat behind smoke and mirrors. And losing a bit of data.table's stealthy/awesome efficiency in the process.

(Please don't use this casually in production. :-)

Very useful analysis @r2evans. The `data.table::insert` would fix the issue, and your `func` addresses the last part of my question. However a further limitation of this assign is that it does update the underlying table, but this is not seen by an intermediate function in the call stack, which has passed the table on 'by reference'. — sch56, May 06 '21 at 19:34
True, there are many problems with it. I don't suggest this answer is the preferred or canonical approach that the `data.table`-devs should consider, it's merely a *\*hack\*. An alternate approach would be to pass the `data.table` object via an environment and make all changes/additions within the env. That allows by-reference semantics, but it takes a bit more effort on both the caller and callee. — r2evans, May 06 '21 at 19:44

score 1 · Answer 2 · answered May 05 '21 at 18:05

While you can't add rows by reference to a data.table object, you can allocate extra rows before-hand for future updates by reference. Depending on your actual use case, this might be an approach worth considering.

In the example below, I start by allocating 1000 rows to a DB table, even though the starting DT only has 3 values. I can then run an arbitrary number of "inserts", as long as I don't exceed the original allocation.

## Create a "Data Base" table with an over-allocated number of rows
DB <- data.table(ID = seq_len(1e3),
                 InUse = FALSE,
                 col1 = as.integer(NA),
                 col2 = as.integer(NA), key = "ID")


## Starting Table DT
DT <- data.table(InUse = TRUE,
                 col1 = c(1, 2, 3),
                 col2 = as.integer(NA))

## Add an ID lookup key
DT[, ID := seq_len(.N)]


## Populate the "Data Base" from Starting DT
setkey(DT,ID)
DB[DT, c("InUse","col1","col2") := .(i.InUse,i.col1, i.col2)]


l3 <- function(x, action){
  if (action == 'update') x[, col2 := col1 + 1L]
  if (action == 'append') {
    ## Define rows to be appended
    NR <- data.table(InUse = TRUE,
                     col1 = sample.int(10L,sample.int(n = 3L, size = 1L)),
                     col2 = as.integer(NA))

    ## Allocate unused ID's for new rows
    NR[,ID := seq_len(.N) + x[InUse == TRUE,ID[.N]]]
    
    ## Allocate values for new rows by joining on ID
    x[NR, c("InUse","col1","col2") := .(i.InUse,i.col1, i.col2), on = .(ID)]
    
    }
  
}


set.seed(1L)

DB[InUse == TRUE]
#    ID InUse col1 col2
# 1:  1  TRUE    1   NA
# 2:  2  TRUE    2   NA
# 3:  3  TRUE    3   NA
l3(DB, "update")

DB[InUse == TRUE]
#    ID InUse col1 col2
# 1:  1  TRUE    1    2
# 2:  2  TRUE    2    3
# 3:  3  TRUE    3    4

l3(DB,"append")
l3(DB, "update")
DB[InUse == TRUE]
#    ID InUse col1 col2
# 1:  1  TRUE    1    2
# 2:  2  TRUE    2    3
# 3:  3  TRUE    3    4
# 4:  4  TRUE    4    5

l3(DB,"append")
l3(DB, "update")
DB[InUse == TRUE]
#    ID InUse col1 col2
# 1:  1  TRUE    1    2
# 2:  2  TRUE    2    3
# 3:  3  TRUE    3    4
# 4:  4  TRUE    4    5
# 5:  5  TRUE    1    2
# 6:  6  TRUE    2    3
# 7:  7  TRUE    5    6

This actually is reasonably performant, with generating the new rows to be added the slowest part of the process. In the example below, I simplified the function to be an "insert" only case using data.table::set for performance, with two arguments - DB for the "database" and table new with rows to be inserted.

insert <- function(x, new){
  
  ## Allocate unused ID's for new rows
  set(new, j = "InUse", value = TRUE)
  set(new, j = "ID", value = seq_len(nrow(new)) + x[InUse == FALSE,min(ID)] - 1L)
  setkey(new,ID)
  
  ## Allocate values for new rows
  x[NR, c("InUse","col","col2") := .(i.InUse,i.col, i.col2), on = .(ID)]
  
}

## Create a "Data Base" table with an over-allocated number of rows
DB <- data.table(ID = seq_len(1e5),
                 InUse = FALSE,
                 col1 = as.integer(NA),
                 col2 = as.integer(NA), key = "ID")

NR <- data.table(col = 1L,
                 col2 = 2L)

microbenchmark(insert(DB,NR),times = 1000)
# Unit: milliseconds
#           expr      min       lq     mean   median       uq      max neval
# insert(DB, NR) 4.081703 4.231624 4.645903 4.278256 4.362353 60.53741  1000

There are a lot of performance metrics out there for database "insert speed" and they all vary based on the size/shape of data to be inserted, but 4.6 milliseconds (217 inserts per second) seems fast enough for a lot of use cases I can imagine (a shiny app, perhaps).

Very helpful in well-controlled situations where the pre-allocation can be assured to be sufficient. I have seen code that 'extends' by blocks if the pre-allocation is about to be breached, but this again requires the 'deep copy'. — sch56, May 06 '21 at 19:38

Updates to data.table when passed as a function argument

2 Answers2