I have two massive
datasets. I created a simple example (below) to replicate the structure of my data: DT_ade
has variables a
, d
, e
, while DT_abce
has variables a
, b
, c
, e
.
I would like to compute the average of d
by a-b-c
. In the code below, I merged the two DT
based on values of a-e
. Because I am not interested in variable e
, I just selected the remaining variables after the join (a
, b
, c
, d
). I then computed the average of d
by a-b-c
.
library(data.table)
DT_ade = data.table(a = c("a1", "a1", "a2", "a3", "a3"),
d = c(10, 20, 30, 40, 80) ,
e = c("e1", "e2", "e2", "e2", "e3"))
DT_abce = data.table(a = c("a1", "a1", "a1", "a1", "a2", "a3", "a3"),
b = c("b1", "b1", "b2", "b2", "b2", "b2", "b2"),
c = c("c1", "c1", "c2", "c2", "c2", "c2", "c2"),
e = c("e1", "e2", "e1", "e2", "e2", "e2", "e3"))
DT_ade[
DT_abce, on=.(a, e), .(a, b, c, d)
][, .(mean_d = mean(d, na.rm=TRUE)),
.(a, b, c)]
The code works in this simple example: it gives the following output:
a b c mean_d
1: a1 b1 c1 15
2: a1 b2 c2 15
3: a2 b2 c2 30
4: a3 b2 c2 60
However, when applying this code to my real data, I get the following error message due to the size of my data:
Error in gforce(thisEnv, jsub, o__, f__, len__, irows) :
Internal error: Failed to allocate counts or TMP when assigning g in gforce
I am wondering if there was any less demanding way to do this. For instance, could I compute the mean d
by a-b-c
while not generating the full list a-b-c-d
of as a previous step?