Dataframe is too big for merge - is there an alternative?

Question

I have a dataframe with the following columns:

id condition trial   a0   a1 ... a7   b1   b2 ... b7
 1      1      1     23   56 ... 52   87   56 ... 98
 1      1      2     57   87 ... 87   55   52 ... 52
 1      1      3     23   56 ... 52   87   56 ... 98
 2      1      1     23   56 ... 82   97   56 ... 68

a and b are values measured at times 0, 1 ... 7. Measure a has values at 0 and b has not.

I want to have it in the format:

id condition trial time measureA  measureB
 1     1      1     0       23      NA
 1     1      2     0       57      NA
 1     1      3     0       23      NA
 ...
 1     1      1     1       56      87
 1     1      2     1       87      55 
 ...

In other words, I have a 7 second time series for 2 measures spread over 2*7 columns, and I wanted to "collect" them in a single column. One of them also has a value at time 0.

This is what I have tried so far:

I used the functiongather

x = gather(x, timeA, measureA, -c(1:3, 4:11))
x = gather(x, timeB, measureB, -c(1:3, 10,11))

and could put the dataframe in the following format:

id condition trial    timeA  measureA   timeB  measureB
                         0       12       1       45
                         0       52       1       85
                         1       51       2       55
                         1       45       2       92
                         2       45       3       15
                         2       52       3       54

I wanted to use merge to merge timeA and timeB, but it does not work because the data frame is too big.

I used the code:

merge(x[1:5], x[6:7], by.x="timeA", by.y = "timeB")

I get the error message

Error: cannot allocate vector of size 40980.3 Gb

Is there any way to achieve what I want to do?

Please show the code you tried for merging. You probably tried an unintended cartesian join. — Roland, May 24 '16 at 12:36
Your small example already shows that your `timeA` and `timeB` values are not unique. Thus, you get a cartesian join which explains the huge memory demand. I suspect that you do not want this join, but the expected result is unclear. — Roland, May 24 '16 at 12:41
They are indeed not unique. There is one value of timeA and timeB for each combination of the other variables (id, condition, trial). — Sininho, May 24 '16 at 12:44
Provide a small reproducible example of what you have and the result you need and it would be easy to help you. As it is, composing a good answer involves too much guesswork. — Roland, May 24 '16 at 12:46
I tried editing the question with a more complete view of what I have and what I want. I hope it makes it more clear. — Sininho, May 24 '16 at 12:58
No, that isn't sufficient. Please read this FAQ: http://stackoverflow.com/a/5963610/1412059 We need something that lets us recreate your problem and shows the corresponding exact expected result. — Roland, May 24 '16 at 13:02

rafa.pereira · Answer 1 · 2016-05-24T13:11:48.603

0

I guess you could try using sqldf or data.table. Could you provide more info in your question? sample data? what kind of merge are you trying (left join?)

library(sqldf)
library(data.table)

using `sqldf`

sqldf = sqldf("SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.x = df2.x")

using `data.table`

data.table = dt1[dt2, nomatch = 0L, on = "x",  allow.cartesian=TRUE ]

edited May 24 '16 at 13:11

answered May 24 '16 at 12:40

rafa.pereira

13,251
6
71
109

I think this Q just needs `data.table::melt.data.table`. Also the argument is `allow.cartesian`, not `cartesian`. It's better to test the code. – Arun May 24 '16 at 13:06
Thank you for the heads up. I'm still trying to understand what exactly is the question. – rafa.pereira May 24 '16 at 13:11
OP wants all `a*` columns and `b*` columns collapsed together (in long format). Since `b0` doesn't exist, that needs to be filled with `NA` (corresponding to the column `a0`). – Arun May 24 '16 at 13:13
Thanks @Arun. You're very generous. Please feel free to post your answer or to edit mine. :) – rafa.pereira May 24 '16 at 13:16

Dataframe is too big for merge - is there an alternative?

1 Answers1

using sqldf

using data.table

using `sqldf`

using `data.table`