Combining "by" and "on" to join and create summary column for data.table

Question

I have two datasets, one a detailed dataset of weight's and another that's supposed to be a summary dataset. I am trying to create the summary dataset by joining the detail dataset and aggregating, but it isn't working as expected.

Here's a sample code.

mytesta <- data.table(cola = c("a","b"), groupa = c(1,2))  # summary
mytestb <- data.table(groupa = c(1,1,1,1,2,2,2), weighta = c(10,20,30,25,15,30,10))  #detail

And this is my desired output.

   cola groupa weighta
1:    a      1      85
2:    b      2      55

What I tried to do is,

mytesta[mytestb, on = "groupa", weight_summary := sum(i.weighta), by = "groupa"]

The problem is that when by is used, the columns of the inner data.table disappear (for instance, mytesta[mytestb, on = "groupa", .SD, by = "groupa"]). Is there a way around this?

In a chained statement: `mytesta[mytestb, on = "groupa", nomatch = 0][, .(weight_summary = sum(weighta)), by = .(cola, groupa)]` ? — SymbolixAU, Mar 13 '17 at 22:47
Thank you, this kind of works, but not fully. My actual summary dataset has quite a few more columns. Is there a way, ideally, to update by reference? Or, next best thing, a solution where instead of `by = .(cola, groupa)`, an expression that excludes column `weighta` is written? — Naumz, Mar 13 '17 at 23:39
you can pass a character vector into `by`, so something like : `by = c(names(mytesta))]` will work — SymbolixAU, Mar 13 '17 at 23:43
It solves my problem. thank you! If you'd like to post it as an answer, please go ahead. — Naumz, Mar 13 '17 at 23:50
I'll leave it for the moment because I have a feeling there is a better way of doing it; and there are a lot more knowledgable people on `data.table` than me that could probably solve it better :) — SymbolixAU, Mar 13 '17 at 23:56
I always do `mytesta[, v := mytestb[.SD, on=.(groupa), sum(weighta), by=.EACHI]$V1 ]`. I'm guessing there's a dupe around somewhere for this but am not going to look right now. — Frank, Mar 14 '17 at 00:36
@Frank that's what I was looking for. Would you like to post the answer? — Naumz, Mar 14 '17 at 17:36

score 2 · Accepted Answer · answered Mar 14 '17 at 17:42

2

I would do

mytesta[, v := mytestb[.SD, on=.(groupa), sum(weighta), by=.EACHI]$V1 ]

In a X[Y] join, we're looking up each row of Y in X.

So if the end goal is to create a new column in Y computed per row, we'll need a join Y[, v := X[Y, ...]] even though Y[X, v := ...] might seem more intuitive at first.

answered Mar 14 '17 at 17:42

Frank

66,179
8
96
180

That makes sense, thanks! I'm still a bit confused about the seemingly more intuitive approach. If `Y[X]` looks up each row of `X` in `Y`, why are the columns of `X` not available in `by`? They're neither available as `by="weighta"` nor as `j="weighta", by="groupa"`. I thought it's because the original data.table `Y` is not actually modified by the join operation, but the copy does increase the number of rows in Y so there is _some_ change as seen by, `mytesta[mytestb, on="groupa", .N, by="groupa"]` (this code seems buggy as the entries in `groupa` are wrong, so perhaps I'm missing something). – Naumz Mar 14 '17 at 18:31
1

@Naumz Ah, that is a good question. I guess it gets complicated when you consider that any modifications in `j` of `Y[X, j, by]` pertain to `Y` -- it must be unambiguous what it means to "group by" columns of `X` and `Y` and even computed columns like `xcol + ycol %% 2`. For the time being, only `by=.EACHI` works. If you don't care about doing something in `Y`, then `Y[X][, j, by]` always works (creating a new table). There is a FR for extending how `by` works here, but I think it's a tough problem. – Frank Mar 14 '17 at 18:37
And regarding the "bugginess", this is not a bug. The values from `X` in `Y[X, on=col]` are the ones that appear in the resulting table. I agree that it's kind of confusing, but there's a reason for it. There's also a FR or two regarding extending it. – Frank Mar 14 '17 at 18:39
I understand your first comment regarding the functionality of `by`. Thanks for clarifying that it's something not implemented! About the 'bugginess', I understand which values should appear, but the output is not so, I think. For instance, if I do `mytesta[mytestb, on="groupa", .SD, by="groupa"]`, I'd expect both values of `groupa` (1 and 2, both of which appear in `mytestb`) to appear in the output. However, the output col of `groupa` comprises of all 1's and no 2's. I think staying away from using `by` with column names when joining is a safe solution. – Naumz Mar 14 '17 at 19:28
Note that in the above "bug", the output of `cola` is as expected. It's just `groupa` that is messed up. – Naumz Mar 14 '17 at 19:30
1

@Naumz Yeah, I think `by=groupa` is not supported usage and should probably return an error. `mytesta[mytestb, on="groupa", .SD, by=.EACHI]` gives the results you mention expecting. – Frank Mar 14 '17 at 19:35
1

Yes, that's why I ended up doing elsewhere based on this answer: http://stackoverflow.com/questions/27004002/eachi-in-data-table/27004566#27004566 – Naumz Mar 14 '17 at 19:38

score 1 · Answer 2 · answered Mar 14 '17 at 00:22

Here is a solution where I first merge your two data.tables, then summarize.

tab = merge(mytesta, mytestb, by="groupa")
tab
#    groupa cola weighta
# 1:      1    a      10
# 2:      1    a      20
# 3:      1    a      30
# 4:      1    a      25
# 5:      2    b      15
# 6:      2    b      30
# 7:      2    b      10

res = tab[, list(weighta=sum(weighta)), by=list(cola, groupa)]
res
#    cola groupa weighta
# 1:    a      1      85
# 2:    b      2      55

Combining "by" and "on" to join and create summary column for data.table

2 Answers2