4

I have a data.table with a list column, where each element is a data.table:

dt <- data.table(id = c(1, 1, 2),
                 var = list(data.table(a = c(1, 2), b = c(3, 4)),
                            data.table(a = c(5, 6), b = c(7, 8)),
                            data.table(a = 9, b = 10)))

dt
# id             var
# 1:  1 <data.table>
# 2:  1 <data.table>
# 3:  2 <data.table>

Now I want to "unlist" this structure to:

   a  b id
1: 1  3  1
2: 2  4  1
3: 5  7  1
4: 6  8  1
5: 9 10  2

I know how to expand the embedded data.table part with rbindlist, but just have no idea how to bind the flattened data.table with variable "id".

The original dataset is 30 million lines and with dozens of variables, so I would really appreciate if you would propose solution not only workable but also memory efficient.

Henrik
  • 65,555
  • 14
  • 143
  • 159
R. Zhu
  • 415
  • 4
  • 16
  • @Frank Thank you for your comment but I have to say this solution only works for situation where id is unique. If the id is duplicated then var[[1]] only retrieve the first row of each id. I've modified my example to incorporate this situation. – R. Zhu Apr 01 '17 at 04:42
  • Ok, my answer (below) already handles that, I guess. – Frank Apr 01 '17 at 04:46
  • Related: [Unlist data frame column preserving information from other column](http://stackoverflow.com/questions/26194298/unlist-data-frame-column-preserving-information-from-other-column) – Henrik Apr 01 '17 at 20:13

1 Answers1

6

In this case dt[, var[[1]], by=id] works. However, I use rbindlist as the OP mentioned:

dt[, r := as.character(.I) ]
res <- dt[, rbindlist(setNames(var, r), id="r")]

Then merge on r (rows of dt) if you really need any vars from there:

res[dt, on=.(r), `:=`(id = i.id)]

This is better than dt[, var[[1]], by=id] in a few ways:

  • rbindlist should be faster than something with a lot of by= groups.
  • If there are more vars in dt, all of them will have to end up in by=.
  • Probably, it is not necessary to carry over vars from dt at all, since they can always be grabbed from that table later and they take up a lot less memory there.
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Fwiw, here's an illustration of what I have in mind regarding "more vars" in `dt`: http://franknarf1.github.io/r-tutorial/_book/tables.html#rbindlist-read (the `fn` and `year` columns) – Frank Apr 01 '17 at 04:50
  • 1
    Thanks! I'm digging into it. – R. Zhu Apr 01 '17 at 04:52
  • Your code saved my life! It's super fast. Now I only got one question that I didn't see it necessary to add the "setNames" function in the rbindlist since I get the same result without it. Could you explain it? – R. Zhu Apr 01 '17 at 08:06
  • @R.Zhu Cool! Yes, good observation, you are right. I guess I habitually use it, since in general, one might want the linking column to be something other than the row numbers. For example, if `dt[, r := sprintf("%07f.", .I)]` (row numbers with leading zeros) or if handling files (like the linked example) and using file names to link. – Frank Apr 01 '17 at 15:11
  • 3
    @Frank Excellent R/data.table tutorial! Thanks for sharing. – Henrik Apr 01 '17 at 19:46