1

I am loading some JSON data using jsonlite which is resulting in some nested data similar (in structure) to the toy data.table dt constructed below.

Setup:

dt <- data.table(a=c("abc", "def", "ghi"), b=runif(3))
dt[, c:=list(list(data.table(d=runif(4), e=runif(4))))]
dt
     a         b            c
1: abc 0.2623218 <data.table>
2: def 0.7092507 <data.table>
3: ghi 0.2795103 <data.table>

In my data there exists key/identifier columns (e.g., column a in the above data.table) but the nested data has no such identifier. I need to work with the nested data (see my related question) but before using rbindlist I want to push the value of column a as a constant column of the nested data.

For example, the value of c in the first row is:

dt[1, c]
[[1]]
            d         e
1: 0.27951027 0.6350479
2: 0.45996594 0.6148956
3: 0.29377315 0.8432641
4: 0.01850265 0.1346082

My objective is to modify it such that it look like the following before binding the nested tables together so that I know which rows go with which record:

dt[1, c]
[[1]]
            d         e   a
1: 0.27951027 0.6350479 abc
2: 0.45996594 0.6148956 abc
3: 0.29377315 0.8432641 abc
4: 0.01850265 0.1346082 abc

I can make this happen using a custom function that gets called within the J-expression and takes the nested data.table for that row as well as the constant column (a in this example) name and value. Something along the lines of:

add_nested_column <- function(dfl, name, value) {
  df <- dfl[[1]] # data.frame wrapped in list per row, pull it out
  df[, c(name):=value]  
  return(list(list(df)))
}

used like so:

dt[, key:=1:.N]
dt[, c:=add_nested_column(c, "a", a), by=key]
rbindlist(dt[, c])
             d         e   a
 1: 0.27951027 0.6350479 abc
 2: 0.45996594 0.6148956 abc
 3: 0.29377315 0.8432641 abc
 4: 0.01850265 0.1346082 abc
 5: 0.27951027 0.6350479 def
 6: 0.45996594 0.6148956 def
 7: 0.29377315 0.8432641 def
 8: 0.01850265 0.1346082 def
 9: 0.27951027 0.6350479 ghi
10: 0.45996594 0.6148956 ghi
11: 0.29377315 0.8432641 ghi
12: 0.01850265 0.1346082 ghi

This solution doesn't seem all that bad to me but I'm wondering if it is really the best practice or if there are features built into data.table or other techniques that I am unaware of that would make it more efficient/concise.

Community
  • 1
  • 1
Matt Pollock
  • 1,063
  • 10
  • 26
  • Your example seems to be copying column "a" to column "a" (so no change)...? – Frank Jul 08 '15 at 15:27
  • It copies column "a" from the top level table to a new column "a" in the nested table. Note that `rbindlist(dt[, c])` is binding all rows from the nested `data.table`s found in column "c". The objective is to get at the nested data while retaining enough information about where the rows came from (prior to the row bind) to be useful – Matt Pollock Jul 08 '15 at 15:33
  • Ok, I think I get the idea now. (posted an answer) Generally, it's best to use `set.seed` when generating random data so that we're all looking at the same thing. Also, I find discrete random values (like from `sample` a lot easier to parse at a glance than continuous `rnorm` values). – Frank Jul 08 '15 at 15:41
  • I added another line to the answer. There's a much nicer way to do it in data.table 1.9.5+ – Frank Jul 08 '15 at 17:36

1 Answers1

3

To achieve the end result, the most straightforward way is

dt[,c[[1]],by=a]

which gives

      a         d          e
 1: abc 0.9082078 0.66079779
 2: abc 0.2016819 0.62911404
 3: abc 0.8983897 0.06178627
 4: abc 0.9446753 0.20597457
 5: def 0.9082078 0.66079779
 6: def 0.2016819 0.62911404
 7: def 0.8983897 0.06178627
 8: def 0.9446753 0.20597457
 9: ghi 0.9082078 0.66079779
10: ghi 0.2016819 0.62911404
11: ghi 0.8983897 0.06178627
12: ghi 0.9446753 0.20597457

You can alternately do

rbindlist( setNames(dt[["c"]], dt[["a"]]), idcol = TRUE )

Data

set.seed(1)
dt <- data.table(a=c("abc", "def", "ghi"), b=runif(3))
dt[, c:=list(list(data.table(d=runif(4), e=runif(4))))]
Frank
  • 66,179
  • 8
  • 96
  • 180