I'm working with JSON data that gets parsed (using jsonlite::fromJSON
) to a nested data.frame
which I am then recursively setting to a data.table
using setDT
. The issue is that to "explode along" any column of nested data.table
elements (e.g., dt[, nested_dt[[1]], by=.(a, b, c)]
, see the accepted answer here) it is necessary to (1) ensure all nested data.table
s have the same columns and (2) make sure those columns have the same class.
The trouble is that there appears to be some issue with R (or perhaps data.table
, I'm not sure) triggering a shallow copy when a new column is added to a nested data.table
.
I'd like to do something like this (with actual logic around the added column name and type):
add_col1 <- function(dt) {
if (is.data.table(dt))
dt[, new_col:=NA]
if (is.list(dt))
lapply(dt, add_col1)
return(invisible())
}
However testing yields
dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
dt
# a b
# 1: 1 <data.table>
# 2: 2 <data.table>
add_col1(dt)
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(new_col, NA)) :
# Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table
# so that := can add this new column by reference. At an earlier point, this data.table
# has been copied by R (or been created manually using structure() or similar). Avoid
# key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table.
# Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2,
# list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please
# upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to
# datatable-help so the root cause can be fixed.
dt
# a b new_col
# 1: 1 <data.table> NA
# 2: 2 <data.table> NA
dt[, b]
# [[1]]
# d e
# 1: a 100
# 2: b 200
#
# [[2]]
# d e
# 1: a 100
# 2: b 200
So I triggered a bad copy and didn't get the desired result (new_col
was added to the top level data.table
which is good, but not to the nested data.table
s which is bad). Since I think the issue is that lapply
isn't assigning back to the original parent data.table
I tried:
add_col2 <- function(dt) {
if (is.data.table(dt)) {
dt[, new_col:=NA]
id <- unlist(lapply(dt, is.list))
for (col in colnames(dt)[id])
dt[, c(col):=add_col2(get(col))]
} else if (is.list(dt))
return(invisible(lapply(dt, add_col2)))
return(invisible(dt))
}
As shown below, this generates the desired output, but I do not avoid the shallow copy (or the warning message that comes with it).
dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
dt
# a b
# 1: 1 <data.table>
# 2: 2 <data.table>
add_col2(dt)
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(new_col, NA)) :
# Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table
# so that := can add this new column by reference. At an earlier point, this data.table
# has been copied by R (or been created manually using structure() or similar). Avoid
# key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table.
# Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2,
# list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please
# upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to
# datatable-help so the root cause can be fixed.
dt
# a b new_col
# 1: 1 <data.table> NA
# 2: 2 <data.table> NA
dt[, b]
# [[1]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
#
# [[2]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
Is there a right way to do this? I can suppress the warning and go with the add_col2
pattern above, but if there is a way to modify the nested data in place without taking a copy that would be great. I am also aware of the possibility of using rbindlist
with fill=TRUE
however since my use case involves a by=
argument I'd rather avoid that approach.
These questions were helpful for understanding but didn't solve my issue:
Adding new columns to a data.table by-reference within a function not always working
Using setDT inside a function
EDIT ------------------------
Avoiding lapply
doesn't seem to help. The following yields exactly the same results as add_col2
.
add_col3 <- function(dt) {
if (is.data.table(dt)) {
dt[, new_col:=NA]
id <- unlist(lapply(dt, is.list))
for (col in colnames(dt)[id]) {
for (i in seq(1, dt[, .N]))
dt[i, c(col):=.(list(add_col3(get(col)[[1]])))]
}
} else if (is.list(dt))
stop("should not reach this now")
return(invisible(dt))
}
EDIT 2 -------------------------
Per Eddi's comment below, I get the desired result with add_col1
by adding a setDF
/setDT
step like so:
dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
# here is the addition
lapply(dt$b, setDF)
lapply(dt$b, setDT)
dt
# a b
# 1: 1 <data.table>
# 2: 2 <data.table>
add_col1(dt)
dt
# a b new_col
# 1: 1 <data.table> NA
# 2: 2 <data.table> NA
dt[, b]
# [[1]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
#
# [[2]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
I do not understand why this step worked though. It does not appear to be because the original dt
was formed by recycling the nested data.table
. I got the same results using
dt <- data.table(a=c("abc", "def", "ghi"))
ndt1 <- data.table(d=c(1.2, 1.4), e=c("a1", "b1"))
ndt2 <- data.table(d=c(1L, 2L), e=c("a2", "b2"), f=c(1, 2))
ndt3 <- data.table(d=c(1.6, 3.4), e=c("a3", "b3"))
dt[, b:=c(list(ndt1),
list(ndt2),
list(ndt3))]