3

I'm working with JSON data that gets parsed (using jsonlite::fromJSON) to a nested data.frame which I am then recursively setting to a data.table using setDT. The issue is that to "explode along" any column of nested data.table elements (e.g., dt[, nested_dt[[1]], by=.(a, b, c)], see the accepted answer here) it is necessary to (1) ensure all nested data.tables have the same columns and (2) make sure those columns have the same class.

The trouble is that there appears to be some issue with R (or perhaps data.table, I'm not sure) triggering a shallow copy when a new column is added to a nested data.table.

I'd like to do something like this (with actual logic around the added column name and type):

add_col1 <- function(dt) {
  if (is.data.table(dt)) 
    dt[, new_col:=NA]

  if (is.list(dt)) 
    lapply(dt, add_col1)

  return(invisible())
}

However testing yields

dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
dt
#    a            b 
# 1: 1 <data.table>     
# 2: 2 <data.table> 

add_col1(dt)
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(new_col, NA)) :
#    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table 
#    so that := can add this new column by reference. At an earlier point, this data.table 
#    has been copied by R (or been created manually using structure() or similar). Avoid 
#    key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. 
#    Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, 
#    list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please 
#    upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to 
#    datatable-help so the root cause can be fixed.

dt
#    a            b new_col
# 1: 1 <data.table>      NA
# 2: 2 <data.table>      NA

dt[, b]
# [[1]]
#    d   e
# 1: a 100
# 2: b 200
# 
# [[2]]
#    d   e
# 1: a 100
# 2: b 200

So I triggered a bad copy and didn't get the desired result (new_col was added to the top level data.table which is good, but not to the nested data.tables which is bad). Since I think the issue is that lapply isn't assigning back to the original parent data.table I tried:

add_col2 <- function(dt) {
  if (is.data.table(dt)) {
    dt[, new_col:=NA]

    id <- unlist(lapply(dt, is.list))
    for (col in colnames(dt)[id])
      dt[, c(col):=add_col2(get(col))]
  } else if (is.list(dt)) 
    return(invisible(lapply(dt, add_col2)))

  return(invisible(dt))
}

As shown below, this generates the desired output, but I do not avoid the shallow copy (or the warning message that comes with it).

dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
dt
#    a            b 
# 1: 1 <data.table>     
# 2: 2 <data.table> 

add_col2(dt)
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(new_col, NA)) :
#    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table 
#    so that := can add this new column by reference. At an earlier point, this data.table 
#    has been copied by R (or been created manually using structure() or similar). Avoid 
#    key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. 
#    Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, 
#    list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please 
#    upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to 
#    datatable-help so the root cause can be fixed.

dt
#    a            b new_col
# 1: 1 <data.table>      NA
# 2: 2 <data.table>      NA

dt[, b]
# [[1]]
#    d   e new_col
# 1: a 100      NA
# 2: b 200      NA
# 
# [[2]]
#    d   e new_col
# 1: a 100      NA
# 2: b 200      NA

Is there a right way to do this? I can suppress the warning and go with the add_col2 pattern above, but if there is a way to modify the nested data in place without taking a copy that would be great. I am also aware of the possibility of using rbindlist with fill=TRUE however since my use case involves a by= argument I'd rather avoid that approach.

These questions were helpful for understanding but didn't solve my issue:
Adding new columns to a data.table by-reference within a function not always working
Using setDT inside a function

EDIT ------------------------

Avoiding lapply doesn't seem to help. The following yields exactly the same results as add_col2.

add_col3 <- function(dt) {
  if (is.data.table(dt)) {
    dt[, new_col:=NA]
    id <- unlist(lapply(dt, is.list))
    for (col in colnames(dt)[id]) {
      for (i in seq(1, dt[, .N]))
        dt[i, c(col):=.(list(add_col3(get(col)[[1]])))]
    }
  } else if (is.list(dt)) 
    stop("should not reach this now")

  return(invisible(dt))
}

EDIT 2 -------------------------

Per Eddi's comment below, I get the desired result with add_col1 by adding a setDF/setDT step like so:

dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))

# here is the addition
lapply(dt$b, setDF)
lapply(dt$b, setDT)

dt
#    a            b
# 1: 1 <data.table>
# 2: 2 <data.table>

add_col1(dt)
dt
#    a            b new_col
# 1: 1 <data.table>      NA
# 2: 2 <data.table>      NA

dt[, b]
# [[1]]
#    d   e new_col
# 1: a 100      NA
# 2: b 200      NA
# 
# [[2]]
#    d   e new_col
# 1: a 100      NA
# 2: b 200      NA

I do not understand why this step worked though. It does not appear to be because the original dt was formed by recycling the nested data.table. I got the same results using

dt <- data.table(a=c("abc", "def", "ghi"))
ndt1 <- data.table(d=c(1.2, 1.4), e=c("a1", "b1"))
ndt2 <- data.table(d=c(1L, 2L), e=c("a2", "b2"), f=c(1, 2))
ndt3 <- data.table(d=c(1.6, 3.4), e=c("a3", "b3"))
dt[, b:=c(list(ndt1),
          list(ndt2),
          list(ndt3))]
Community
  • 1
  • 1
Matt Pollock
  • 1,063
  • 10
  • 26
  • 1
    your first function isn't working for the same reason `add_col1(dt[, b])` doesn't add a `NA` column to each table in column b - you cant lapply a `:=` call – Chris Apr 05 '16 at 19:05
  • A data.table is also a list. Not sure if you want to write the first fun that way. – Frank Apr 05 '16 at 19:05
  • @Chris, understood w.r.t. `lapply` not working for `:=`. I was able to recursively `setDT` using `lapply` which was nice. In `add_col1` above the nested `data.table`s do have a column added to them (stepping through in debug mode) but those modified `data.table`s are what is triggering the shallow copy, so they don't assign back to the parent – Matt Pollock Apr 05 '16 at 19:10
  • I'd blame lapply and switch to a loop in this case. – Frank Apr 05 '16 at 19:10
  • 2
    Somehow the issue is in your original `data.table`, but I'm not sure what it is about your construction that screws it up. Your function works after doing `lapply(dt$b, setDF); lapply(dt$b, setDT)`. – eddi Apr 05 '16 at 19:21
  • @Frank is the above edit what you meant? I don't think `lapply` is causing any issues that `for ...` doesn't also cause. – Matt Pollock Apr 05 '16 at 19:24
  • Oh, I guess you're right about that. I've run into some weird things with lapply and so assumed it was the culprit. – Frank Apr 05 '16 at 19:26
  • 1
    To eddi's point, it also works if you construct your example like `dt <- data.table(a=c(1,2))[,\`:=\`(b = list(data.table(d=c("a", "b"), e = c(100, 200))))]` Maybe nested calls to data.table() don't work right? – Frank Apr 05 '16 at 19:30
  • @eddi that is really odd. Adding your line I get the desired results with `add_col1` above. I don't understand what going back and forth between `data.table` and `data.frame` did though. – Matt Pollock Apr 05 '16 at 19:31
  • I've been using v1.9.7 all along - just updated to latest version from github and got identical results. – Matt Pollock Apr 05 '16 at 19:36
  • 1
    You could post an issue on the tracker if you think it needs fixing. This seems to be enough to reproduce it `DT = data.table(d = list(data.table(a=1))); DT$d[[1]][, new_col := NA]` – Frank Apr 05 '16 at 19:45
  • 2
    I added an issue here: https://github.com/Rdatatable/data.table/issues/1629. Thanks for the concise example. – Matt Pollock Apr 05 '16 at 19:51
  • Since `data.table` is also a `list` so in both the case you will returned TRUE. The operation under `data.table` is valid while the operation under `list` is not valid for `data.table` and is throwing warning. You can use the same solution, but you can differentiate data.table and list with `"data.table" %in% class(dt)` – TheRimalaya Apr 05 '16 at 20:00
  • When `x` is a `data.table` I want both `if` expressions in `add_col1` to evaluate to `TRUE`. That controls the recursion. If you debug it you'll see that `lapply` on a `data.table` re-calls the function for each column, at which point (on the 2nd call) I have a list of `data.table` elements that I again call `lapply` on to get into each nested `data.table`. – Matt Pollock Apr 05 '16 at 20:05
  • @MattPollock what going back and forth does, is it fixes whatever is wrong (in place) since you start from a clean slate of `data.frame`, but I'm not sure what exactly is wrong to begin with – eddi Apr 05 '16 at 20:33
  • Thanks. It appears that somewhere (maybe when wrapped by `list`?) the memory address changes. `address(ndt1) != address(dt[1, b[[1]]])` in the edit 2 above – Matt Pollock Apr 05 '16 at 20:36

0 Answers0