I faced memory error to copy rows into columns using gather/unite/spread
technique in dplyr
in R
in another question here (How to copy grouped rows into column by dplyr/tidyverse in R?).
It is the data frame that I am using as an example: (Sorry, most of this question just replicates previous question)
df <- data.frame(
hid=c(1,1,1,1,2,2,2,2,2,3,3,3,3),
mid=c(1,2,3,4,1,2,3,4,5,1,2,3,4),
tmid=c("010","01010","010","01020",
"010","0120","010","010","020",
"010","01010","010","01020"),
thid=c("010","02020","010","02020",
"000","0120","010","010","010",
"010","02020","010","02020")
)
My desired output is show below:
hid mid tmid thid tmid_1 tmid_2 tmid_3 tmid_4 tmid_5 thid_1 thid_2 thid_3 thid_4 thid_5
* <dbl> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr>
1 1 1 010 010 010 01010 010 01020 0 010 02020 010 02020 0
2 1 2 01010 02020 010 01010 010 01020 0 010 02020 010 02020 0
3 1 3 010 010 010 01010 010 01020 0 010 02020 010 02020 0
4 1 4 01020 02020 010 01010 010 01020 0 010 02020 010 02020 0
5 2 1 010 000 010 0120 010 010 020 000 0120 010 010 010
6 2 2 0120 0120 010 0120 010 010 020 000 0120 010 010 010
7 2 3 010 010 010 0120 010 010 020 000 0120 010 010 010
8 2 4 010 010 010 0120 010 010 020 000 0120 010 010 010
9 2 5 020 010 010 0120 010 010 020 000 0120 010 010 010
10 3 1 010 010 010 01010 010 01020 0 010 02020 010 02020 0
11 3 2 01010 02020 010 01010 010 01020 0 010 02020 010 02020 0
12 3 3 010 010 010 01010 010 01020 0 010 02020 010 02020 0
13 3 4 01020 02020 010 01010 010 01020 0 010 02020 010 02020 0
An image of this operation is shown below:
What I am trying to do in this operation are:
- Converting
thid
andtmid
into column - Suffix number in
thid_x
andtmid_x
is defined bymid
; however, maximum number ofmid
is not scalable (it spreads from 1 to 8-10 in actual large data set) mid
is grouped byhid
to define how manymid
s are stored in eachhid
- If value does not exist, it should be padded by
0
(i.e., somehid
have 5mid
s but some have only 4, thus tmid_5 should be 0 for suchhid
)
However, when I do this operation using gather/unite/spread
technique in the previous question, it encounters an memory error saying cannot allocate 11.4GB of memory.
Perhaps the reason of this error is that gather
function needs to create all combinations specified in its argument before splitting them up. Actual data frame has around 80,000 records which exceeds 16GB RAM in my 64-bit version of R
.
Do you have any suggestions to get the same outcome without making such huge intermediate records? Perhaps data.table
may help if it does not require such intermediate operation, however I used to use dplyr
and never used that package.
I would like to have your idea to beyond this issue and would learn new package based on the need of analyses for further steps.