74

Suppose I have two data.table's:

dataA:

   A  B
1: 1 12
2: 2 13
3: 3 14
4: 4 15

dataB:

   A  B
1: 2 13
2: 3 14

and I have the following code:

merge_test = merge(dataA, dataB, by="A", all.data=TRUE)

I get:

   A B.x B.y
1: 2  13  13
2: 3  14  14

However, I want all the rows in dataA in the final merged table. Is there a way to do this?

Ofek Shilon
  • 14,734
  • 5
  • 67
  • 101
lord12
  • 2,837
  • 9
  • 35
  • 47
  • 7
    A search should result in a number of questions that cover this. Here is one: http://stackoverflow.com/questions/12773822/why-does-xy-join-of-data-tables-not-allow-a-full-outer-join-or-a-left-join – mrp Jan 04 '16 at 19:18
  • 5
    If you want to do a left join, you can use `all.x = TRUE`. If you want to do a full outer join, you can use `all = TRUE`. – ytk Jan 04 '16 at 19:19
  • 4
    Judging from votes, maybe consider changing accepted answer? – zx8754 Feb 05 '18 at 08:51

3 Answers3

180

If you want to add the b values of B to A, then it's best to join A with B and update A by reference as follows:

A[B, on = 'a', bb := i.b]

which gives:

> A
   a  b bb
1: 1 12 NA
2: 2 13 13
3: 3 14 14
4: 4 15 NA

This is a better approach than using B[A, on='a'] because the latter just prints the result to the console. When you want to get the results back into A, you need to use A <- B[A, on='a'] which will give you the same result.

The reason why A[B, on = 'a', bb := i.b] is better than A <- B[A, on = 'a'] is memory efficiency. With A[B, on = 'a', bb := i.b] the location of A in memory stays the same:

> address(A)
[1] "0x102afa5d0"
> A[B, on = 'a', bb := i.b]
> address(A)
[1] "0x102afa5d0"

While on the other hand with A <- B[A, on = 'a'], a new object is created and saved in memory as A and hence has another location in memory:

> address(A)
[1] "0x102abae50"
> A <- B[A, on = 'a']
> address(A)
[1] "0x102aa7e30"

Using merge (merge.data.table) results in a similar change in memory location:

> address(A)
[1] "0x111897e00"
> A <- merge(A, B, by = 'a', all.x = TRUE)
> address(A)
[1] "0x1118ab000"

For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:

A[B, on = 'a', bb := i.b] 

Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table was designed.

Probably also worth mentioning is that the order of A stays the same.


To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below):

library(bench)
bm <- mark(AA <- BB[AA, on = .(aa)],
           AA[BB, on = .(aa), cc := cc],
           iterations = 1)

which gives (only relevant measurements shown):

> bm[,c(1,3,5)]
# A tibble: 2 x 3
  expression                         median mem_alloc
  <bch:expr>                       <bch:tm> <bch:byt>
1 AA <- BB[AA, on = .(aa)]            4.98s     4.1GB
2 AA[BB, on = .(aa), `:=`(cc, cc)] 560.88ms   384.6MB

So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory.

NOTE: Gains in speed and memory use might differ in different setups.


Used data:

# initial datasets
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14)

# large datasets for the benchmark
set.seed(2019)
AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE))
BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE))
Frank
  • 66,179
  • 8
  • 96
  • 180
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 8
    Great answer. Just to confirm, I assume the "i" in "A[B, bb:=i.b, on='a']" refers to the"i" in the general data.table "DT[i, j, by]" syntax? – cbailiss May 14 '17 at 07:18
  • 10
    @cbailiss Yes, `i.b` mean that in updating `A` with the join it should look t the `b`-column of `B`. In a similar way, with the `x.` prefix you can refer to columns of `A`. – Jaap May 14 '17 at 09:11
  • 5
    @Jaap How would you manage the join by reference when there are multiple new columns created? Here the new column `bb := i.b` is created, which as you stated looks up the corresponding `b` column value in the `B` `data.table` corresponding to `i`. But what happens when you have many new columns that would potentially be created from merging (by reference) larger `data.tables`? – Prevost Nov 02 '18 at 14:43
  • 4
    @Prevost [see here for an example](https://stackoverflow.com/a/37994369/2204410), I hope that answers your question – Jaap Nov 02 '18 at 15:13
  • 1
    @Jaap It does answer it, thank you. I was hoping for a less typing intensive way, but I guess the trade off is typing for a by reference update. If I have been piping `data.table` operations to create columns I typically pipe in a `merge()` where available, which avoids assigning a new `data.table`. But with smaller `data.tables` or where a new `data.table` must be created from a join I may use your answer above or stick with `merge()`. Thanks for the quick comment! – Prevost Nov 02 '18 at 16:38
  • 4
    @Prevost the tric is in using `mget`, see also the last part of my answer under the link from my previous comment – Jaap Nov 02 '18 at 20:09
  • 2
    I've never seen this `i.` and `x.` syntax from `data.table` documentation before. Great answer – Matthew Son Nov 03 '20 at 13:44
23

You can try this:

# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a') 

B[A]
Jaap
  • 81,064
  • 34
  • 182
  • 193
danas.zuokas
  • 4,551
  • 4
  • 29
  • 39
  • This answer works fine if one data table key is subset of the other. Is there possibility to join if they intersect partially? For example if `A, B` are like: `A <- data.table(a = 1:4, b = 12:15) B <- data.table(a = 2:5, c = 13:16) ` – Lodyk Vovchak May 28 '20 at 09:48
2

For the sake of completeness, I add the table.express version of an answer to your questions. table.express nicely extends the tidyverse language to data.table making it a handy tool to work fastly with huge datasets. Here is the solution using your datasets from the question above:

merge_test = dataA %>% left_join(dataB, by="A")

A left_join keeps all rows from dataA in the joined dataset.

Note: You must load the packages data.table and table.express.

ToWii
  • 590
  • 5
  • 8
  • 1
    this is great! thanks - I had not heard of this package. – nate-m Mar 13 '22 at 12:42
  • 1
    Can you explan how this works? Within tidyverse I could do `left_join(table1, table2, by = c("colname" = "colname")`. This doesn't work with table.express. I need an `on =` argument. However, I can't find any in the documentation. – gernophil Jun 30 '22 at 12:40
  • @gernophil please have a look at the [join documentation](https://asardaes.github.io/table.express/articles/joins.html) for `table.express`. The second argument in `left_join`, which is `by="A"` above, is the `on =` argument you refer to. Optionally, you can use the variable name without quotes. – ToWii Jul 03 '22 at 10:50