How to copy grouped rows into column by data.table in R?

Question

I faced memory error to copy rows into columns using gather/unite/spreadtechnique in dplyr in R in another question here (How to copy grouped rows into column by dplyr/tidyverse in R?).

It is the data frame that I am using as an example: (Sorry, most of this question just replicates previous question)

df <- data.frame(
    hid=c(1,1,1,1,2,2,2,2,2,3,3,3,3),
    mid=c(1,2,3,4,1,2,3,4,5,1,2,3,4),
    tmid=c("010","01010","010","01020",
           "010","0120","010","010","020",
           "010","01010","010","01020"),
    thid=c("010","02020","010","02020",
           "000","0120","010","010","010",
           "010","02020","010","02020")
    )

My desired output is show below:

     hid   mid  tmid   thid  tmid_1  tmid_2  tmid_3  tmid_4  tmid_5  thid_1  thid_2  thid_3  thid_4  thid_5
 * <dbl> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> 
 1     1     1   010    010    010  01010    010  01020      0    010  02020    010  02020      0
 2     1     2 01010  02020    010  01010    010  01020      0    010  02020    010  02020      0
 3     1     3   010    010    010  01010    010  01020      0    010  02020    010  02020      0
 4     1     4 01020  02020    010  01010    010  01020      0    010  02020    010  02020      0
 5     2     1   010    000    010  0120     010    010    020    000   0120    010    010    010
 6     2     2  0120   0120    010  0120     010    010    020    000   0120    010    010    010
 7     2     3   010    010    010  0120     010    010    020    000   0120    010    010    010
 8     2     4   010    010    010  0120     010    010    020    000   0120    010    010    010
 9     2     5   020    010    010  0120     010    010    020    000   0120    010    010    010
10     3     1   010    010    010  01010    010  01020      0    010  02020    010   02020     0
11     3     2 01010  02020    010  01010    010  01020      0    010  02020    010   02020     0
12     3     3   010    010    010  01010    010  01020      0    010  02020    010   02020     0
13     3     4 01020  02020    010  01010    010  01020      0    010  02020    010   02020     0

An image of this operation is shown below:

What I am trying to do in this operation are:

Converting thid and tmid into column
Suffix number in thid_x and tmid_x is defined by mid; however, maximum number of mid is not scalable (it spreads from 1 to 8-10 in actual large data set)
mid is grouped by hidto define how many mids are stored in each hid
If value does not exist, it should be padded by 0 (i.e., some hid have 5 mids but some have only 4, thus tmid_5 should be 0 for such hid)

However, when I do this operation using gather/unite/spread technique in the previous question, it encounters an memory error saying cannot allocate 11.4GB of memory.

Perhaps the reason of this error is that gather function needs to create all combinations specified in its argument before splitting them up. Actual data frame has around 80,000 records which exceeds 16GB RAM in my 64-bit version of R.

Do you have any suggestions to get the same outcome without making such huge intermediate records? Perhaps data.table may help if it does not require such intermediate operation, however I used to use dplyr and never used that package. I would like to have your idea to beyond this issue and would learn new package based on the need of analyses for further steps.

Re RAM costs, seems like avoiding the operation would be the best way to go, as Moody Mudskipper suggested in an earlier comment. (The operation just duplicates your data n times into hard-to-use column names.) — Frank, Jul 02 '18 at 18:20
Yes, this operation seems a bit quick & dirty solution. I will consider other ways for more effective ways. — HSJ, Jul 03 '18 at 02:20

score 1 · Accepted Answer · answered Jul 02 '18 at 18:14

I think you can use a combination of spread and left_join to get what you need:

library(dplyr)
library(tidyr)

a <- select(df, -thid) %>%
  spread(mid, tmid, sep="_") %>%
  rename_at(vars(matches("^mid_")), funs(paste0("t", .)))
b <- select(df, -tmid) %>%
  spread(mid, thid, sep="_") %>%
  rename_at(vars(matches("^mid_")), funs(gsub("^m", "th", .)))

left_join(df, a, by="hid") %>%
  left_join(b, by="hid")
#    hid mid  tmid  thid tmid_1 tmid_2 tmid_3 tmid_4 tmid_5 thid_1 thid_2 thid_3 thid_4 thid_5
# 1    1   1   010   010    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 2    1   2 01010 02020    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 3    1   3   010   010    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 4    1   4 01020 02020    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 5    2   1   010   000    010   0120    010    010    020    000   0120    010    010    010
# 6    2   2  0120  0120    010   0120    010    010    020    000   0120    010    010    010
# 7    2   3   010   010    010   0120    010    010    020    000   0120    010    010    010
# 8    2   4   010   010    010   0120    010    010    020    000   0120    010    010    010
# 9    2   5   020   010    010   0120    010    010    020    000   0120    010    010    010
# 10   3   1   010   010    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 11   3   2 01010 02020    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 12   3   3   010   010    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>
# 13   3   4 01020 02020    010  01010    010  01020   <NA>    010  02020    010  02020   <NA>

Cleaning up the NA values should be easy enough, but may require you to re-factor them (add a level of "0") or just create the frame using stringsAsFactors=FALSE.

Thank you for your suggestion. It worked out. I recognized that there are lots of ways to obtain the same result. — HSJ, Jul 03 '18 at 02:17

How to copy grouped rows into column by data.table in R?

1 Answers1