3

What I'm trying to achieve is similar to SQL

select *
from T left join
     T2 on T1.key2 = T2.key2
     left join T3 on T1.key3 = T3.key3

But in R using data.table. Specifically, 'append' two columns from T2 and one column from T3 to the first table.
My current solution looks like

setkey(t1,key2)
t1 <- t2[t1]
setkey(t1,key3)
t1 <- t3[t1]
setkey(t1,key1) # restore the original key

This copies the entire t1 twice and is not too quick. On the 'inside' t1 is a quite large dataset ~ 1mln rows and t2 and t3 are just small 'mapping' tables.
I think using J(..) could help, but I'm new to data.tables and can't really understand how to apply it here.
Could somebody please suggest a faster solution for the join? Thanks

Sergii Zaskaleta
  • 502
  • 1
  • 4
  • 21
  • 1
    Can you show a reproducible example with desired output? My suspicion that you are looking foe something like [this](http://stackoverflow.com/questions/30913338/join-two-data-tables-and-use-only-one-column-from-second-dt-in-r/30914530#30914530) – David Arenburg Jun 19 '15 at 10:13
  • Yes, this is very close to what I'm looking for. One specific - is it necessary to setkey(dt1) to achieve this? Setting the key would reorder data.table in memory, right? – Sergii Zaskaleta Jun 19 '15 at 10:59
  • Yes, but setting a key is very efficient and takes less than a second even on a 1e8 rows data set. More over, you are not losing the key after the binary join, for example `DT <- data.table(A = letters[1:4]) ; DT2 <- data.table(A = letters[1:3]) ; setkey(DT, A) ; DT[DT2, B := i.A] ; key(DT)`. So you can keep joining to other too tables after you set it only once. – David Arenburg Jun 19 '15 at 11:26
  • Syntax mentioned by David for `t1[t2, col1 := i.col1]` should allows you to NOT copy entire `t1` at all. You can self-answer your question if it solves your problem. This may also helps when implemented: [data.table#1130](https://github.com/Rdatatable/data.table/issues/1130) – jangorecki Jun 19 '15 at 16:58
  • Also nd the `dt1[dt2, col1 := i.col1]` _does not_ create the columns. i.e. there are no `col1` when i run `ls(dt1)` after the join. Are there some other requirements for maybe keying t2? – Sergii Zaskaleta Jun 22 '15 at 09:02
  • @SergiiZaskaleta on no-match there is no new column, this is an open issue [data.table#1166](https://github.com/Rdatatable/data.table/issues/1166) – jangorecki Jun 23 '15 at 22:05
  • @jangorecki but there _is_ a match - both `J(..)` and `t1 <- t2[t1]` approaches produce the desired output. – Sergii Zaskaleta Jun 24 '15 at 09:05
  • @SergiiZaskaleta consider accepting an answer or providing feedback why it didn't address your question. The issue mentioned by me before is already fixed. – jangorecki Apr 29 '16 at 22:08

1 Answers1

1

Since the devel version of data.table, v1.9.5 recently implemented a feature of being able to join data.tables directly without having to set keys using the new on argument, I can provide a quality answer to your question:

t1[t2, colT2 := i.colT2, on = c(key2="key2")
   ][t3, colT3 := i.colT3, on = c(key3="key3")
     ][]

It should also preserve the original key on your t1.

Arun
  • 116,683
  • 26
  • 284
  • 387
jangorecki
  • 16,384
  • 4
  • 79
  • 160