16

Just wondering if there is an efficient way to do outer joins with data table such as

a <- data.table(a=c(1,2,3),b=c(3,4,5))
b <- data.table(a=c(1,2),k=c(1,2))
merge(a,b,by="a",all.x=T)

this works fine, but it is not as efficient as the inner join with bigger data, as the following runs very fast, but the above is really slow.

setkey(a,a)
setkey(b,a)
a[b,]
jamborta
  • 5,130
  • 6
  • 35
  • 55
  • In the first case, `a` and `b` are unkeyed so `merge` will need to key them first (as local copies (kind of) inside merge because it doesn't want to change `a` and `b` in calling scope). In the second case you've been happy to change `a` and `b` by keying them (did you include the time to do that?) and then `a[b]` is fast. But even so I'm suprised there's a large difference. `merge` _should_ be fairly comparable to `x[y]`. Please state version info when talking about timings: are you on v1.8.6? And also your "very fast" and "very slow" might be my idea of "similar"! What are the actual times? – Matt Dowle Nov 21 '12 at 13:50
  • It's very easy to benchmark badly/inappropriately, so we definitely need to see your method of timing before saying anything at all. – Matt Dowle Nov 21 '12 at 13:56
  • I couldn't provide time for this as the first one exploded in memory and crashed the R session (joining around 19m lines). I'll benchmark it with a smaller set and post the results. (version 1.8.2, I'm using) – jamborta Nov 21 '12 at 17:23
  • That would be great on a smaller set. We often see users reporting `merge` exploding (base merge as well as data.table merge) when a cartesian join is accidentally requested. Perhaps we can put some traps in there to help detect and catch incorrect usage. Just a guess. It seems that people sometimes try to use `merge` when they actually need `cbind`. – Matt Dowle Nov 21 '12 at 17:46

1 Answers1

11

b[a,] is the "outer join" you're looking for.

Take a look at ?merge.data.table for more specifics.

Justin
  • 42,475
  • 9
  • 93
  • 111
  • 1
    thanks! so a[b,] or b[a,] is essentially an left join (in SQL terms)? I always thought about it as an inner join. – jamborta Nov 21 '12 at 13:41
  • @jamborta See FAQ 2.16 (`nomatch = 0|NA`) – Matt Dowle Nov 21 '12 at 13:46
  • 2
    thanks Matthew, that explains it. I assume that this way you cannot do full outer join (only left outer and inner)? – jamborta Nov 21 '12 at 17:19
  • 2
    @jamborta Correct. But you can do `X[Y]` or `Y[X]` for left or right. It derives from the _by-without-by_ feature (similar to a CROSS APPLY and OUTER APPLY in SQL). The idea is that `j` runs for each row of `i`. And this doesn't make sense in the full outer join context. If full outer join is needed then that's what `merge` can do. Having said that, it is coming up and more join types have been requested [in this detailed question](http://stackoverflow.com/questions/12773822/why-does-xy-join-of-data-tables-not-allow-a-full-outer-join-or-a-left-join), which is on the list to consider. – Matt Dowle Nov 21 '12 at 17:41
  • Is there any faster alternative to ```merge.data.table``` for full outer join? – Moysey Abramowitz Oct 23 '21 at 17:14