2

I have a large table (~100M row and 28 columns) in the below format:

ID  A   B   C
1   2   0   1
2   0   1   0
3   0   1   2
4   1   0   0

Columns besides ID (which is unique) gives the counts for each type (i.e. A,B,C). I would like to convert this to the below long form.

ID  Type
 1   A
 1   A
 1   C
 2   B
 3   B
 3   C
 3   C
 4   A

I also would like to use data table (rather than data frame) given the size of my data set. I checked reshape2 package in R regarding converting between long and short form however I am not clear if melt function would allow me to have counts in the short form as above.

Any suggestions on how I can convert this in a fast and efficient way in R using reshape2 and/or data.table?

KTY
  • 709
  • 1
  • 9
  • 17
  • http://stackoverflow.com/questions/22822922/creating-a-table-with-individual-trials-from-a-frequency-table-in-r-inverse-of – user295691 Sep 08 '15 at 17:48
  • Have you checked the [reshaping data.tables vignette](https://github.com/Rdatatable/data.table/wiki/Getting-started)? – Arun Sep 08 '15 at 18:21
  • @Arun the vignette doesn't seem to have an example for the case where there are frequencies in the table instead.. – KTY Sep 08 '15 at 19:15
  • @KTY, did you try the solution I shared in my answer? – A5C1D2H2I1M1N2O1R2T1 Sep 09 '15 at 05:53
  • @AnandaMahto thanks for your reply. I used the `melt` as you suggested since it is faster with data tables. For the second step, instead of using `expandRows`, I separated the rows with more than one count and then `replicated` the ID and Type by count for only those rows, and then `merged` it with the original table. Seemed to work fast. – KTY Sep 09 '15 at 22:01

1 Answers1

5

Update

You can try the following:

DT[, rep(names(.SD), .SD), by = ID]
#    ID V1
# 1:  1  A
# 2:  1  A
# 3:  1  C
# 4:  2  B
# 5:  3  B
# 6:  3  C
# 7:  3  C
# 8:  4  A

Keeps the order you want too...


You can try the following. I've never used expandRows on what would become ~ 300 million rows, but it's basically rep, so it shouldn't be slow.

This uses melt + expandRows from my "splitstackshape" package. It works with data.frames or data.tables, so you might as well use data.table for the faster melting....

library(reshape2)
library(splitstackshape)
expandRows(melt(mydf, id.vars = "ID"), "value")
# The following rows have been dropped from the input: 
# 
# 2, 3, 5, 8, 10, 12
# 
#      ID variable
# 1     1        A
# 1.1   1        A
# 4     4        A
# 6     2        B
# 7     3        B
# 9     1        C
# 11    3        C
# 11.1  3        C
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485