8

I have two data.tables with many fields.

I want to join the two tables, add some calculated fields and append all other fields from the first, second or both tables (similar to SQL's select a+b AS sum, DT1.*, DT2.* FROM...) without typing all the field names.

How can I do this (regarding easiest syntax and best performance)?

Simplified example data:

library(data.table)
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), b=6:8)

Now I want to join the tables and add a calculated field:

DT1[DT2, .(sum=a + b, <<< how to say DT1.*, DT2.* here? >>> ), on="x"]

Update May 4, 2016: Inspired by user jangorecki I have found a feature request for this:

Should be able to refer to i's .SD during a join

R Yoda
  • 8,358
  • 2
  • 50
  • 87
  • I'm not sure this is quite what you need, but take a look at the .SDcols argument in data table. May be helpful. – giraffehere May 03 '16 at 14:55
  • @giraffehere I think .SDcols requires me to specify each field name what I want to avoid. I am trying to find a solution with .SD but this does not work (syntax error and no way to differentiate between DT1 and DT2) – R Yoda May 03 '16 at 15:01
  • 1
    Maybe show your desired output too? Your question is a bit confusing to me. – David Arenburg May 03 '16 at 15:19
  • @David Arenburg You are right I have put to much into my question ("variations"). I will reduce this and accept one answer that works. Then I post another question for the variation. – R Yoda May 03 '16 at 15:29

3 Answers3

10

This should precisely answer your need.
It uses very powerful R feature called computing on the language (or meta programming) well described in official R Language Definition manual. This is an exceptional feature of R language and should not be forgotten IMO.

library(data.table)
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), b=6:8)

jj = as.call(c(
    list(as.name(".")),
    list(sum = quote(a+b)),
    lapply(unique(c(names(DT1), names(DT2))), as.name)
))
print(jj)
#.(sum = a + b, x, a, b)
DT1[DT2, eval(jj), on="x"]
#   sum x  a b
#1:  NA d NA 6
#2:   8 c  1 7
#3:  11 b  3 8
#4:  13 b  5 8
jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • 3
    Aside from the educational value, not a fan of the extra complications here. @RYoda if you're ok typing `names(DT*)`, then might as well do: `DT1[DT2, c(list(sum = a+b), mget(union(names(DT1), names(DT2)))), on = 'x']`. There is no support for `i.SD` at the moment, but there is an FR about it. – eddi May 03 '16 at 18:09
  • 2
    worth to note that `x.col` and `i.col` both works in 1.9.7, so `as.name` can populate names names with prefixes, giving full control over columns used – jangorecki May 03 '16 at 18:29
  • @eddie We need splice abilities in bquote. Then this method would actually look clean. e.g. bquote(list(sum=a+b, .@(unique(...)))). Maybe one day I'll take a shot at it (and probably fail) – Clayton Stanley May 04 '16 at 00:44
5

I'm more certain of my answer to the second part of your question, so I'll answer that first. If you only want to say DT1.* or DT2.*, but want the additional column new = a+b, I would do it this way:

DT1[DT2,new:=a+b,on="x"]

For the first part, where you need DT1.* and DT2.*, the only answer I can think of is:

DT1[DT2, on="x"][,new := a+b]

However, there might be more efficient code to achieve this.

shrgm
  • 1,315
  • 1
  • 10
  • 20
  • Your first code snippet is perfect for DT1.*. I think DT2.* does not work this way because `:=` adds the column to DT1. Regarding the second code snippet: Works perfect for DT1.* AND DT2.*, but I did not explicitly mention in my question that I want to avoid adding a new column by reference (I only showed a calculated column therefore in the question) since I do not want to "polute" the data tables with temporary fields. I will clarify my question. – R Yoda May 03 '16 at 15:14
  • **Correction:** I just figured out that `:=` used in the second or higher `[...]` chain does NOT add the field to the original data table so it exactly what I wanted! – R Yoda May 03 '16 at 15:19
  • 1
    I decided to focus my question on the main part (all columns from both tables) and removed the variations (all columns from only one table). Your code snippet 2 is the right solution. – R Yoda May 03 '16 at 15:31
  • Please forgive retagging the "answer accepted" to jangorecki's answer but that is exactly what I needed – R Yoda May 03 '16 at 16:20
0

You can keep only the columns in DT2 that you need:

DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5, d=rnorm(5))
DT2 = data.table(x=c("d", "c", "b"), b=6:8, c=letters[3])

DT3 <- DT1[DT2[,.(x,b), on="x"][, sum := a+b]
David F
  • 1,506
  • 1
  • 12
  • 14