Left join using data.table

Question

Suppose I have two data.table's:

dataA:

dataB:

   A  B
1: 2 13
2: 3 14

and I have the following code:

merge_test = merge(dataA, dataB, by="A", all.data=TRUE)

I get:

   A B.x B.y
1: 2  13  13
2: 3  14  14

However, I want all the rows in dataA in the final merged table. Is there a way to do this?

A search should result in a number of questions that cover this. Here is one: http://stackoverflow.com/questions/12773822/why-does-xy-join-of-data-tables-not-allow-a-full-outer-join-or-a-left-join — mrp, Jan 04 '16 at 19:18
If you want to do a left join, you can use `all.x = TRUE`. If you want to do a full outer join, you can use `all = TRUE`. — ytk, Jan 04 '16 at 19:19
Judging from votes, maybe consider changing accepted answer? — zx8754, Feb 05 '18 at 08:51

score 180 · Answer 1 · edited Jun 28 '19 at 00:22

If you want to add the b values of B to A, then it's best to join A with B and update A by reference as follows:

A[B, on = 'a', bb := i.b]

which gives:

> A
   a  b bb
1: 1 12 NA
2: 2 13 13
3: 3 14 14
4: 4 15 NA

This is a better approach than using B[A, on='a'] because the latter just prints the result to the console. When you want to get the results back into A, you need to use A <- B[A, on='a'] which will give you the same result.

The reason why A[B, on = 'a', bb := i.b] is better than A <- B[A, on = 'a'] is memory efficiency. With A[B, on = 'a', bb := i.b] the location of A in memory stays the same:

> address(A)
[1] "0x102afa5d0"
> A[B, on = 'a', bb := i.b]
> address(A)
[1] "0x102afa5d0"

While on the other hand with A <- B[A, on = 'a'], a new object is created and saved in memory as A and hence has another location in memory:

> address(A)
[1] "0x102abae50"
> A <- B[A, on = 'a']
> address(A)
[1] "0x102aa7e30"

Using merge (merge.data.table) results in a similar change in memory location:

> address(A)
[1] "0x111897e00"
> A <- merge(A, B, by = 'a', all.x = TRUE)
> address(A)
[1] "0x1118ab000"

For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:

A[B, on = 'a', bb := i.b]

Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table was designed.

Probably also worth mentioning is that the order of A stays the same.

To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below):

library(bench)
bm <- mark(AA <- BB[AA, on = .(aa)],
           AA[BB, on = .(aa), cc := cc],
           iterations = 1)

which gives (only relevant measurements shown):

> bm[,c(1,3,5)]
# A tibble: 2 x 3
  expression                         median mem_alloc
  <bch:expr>                       <bch:tm> <bch:byt>
1 AA <- BB[AA, on = .(aa)]            4.98s     4.1GB
2 AA[BB, on = .(aa), `:=`(cc, cc)] 560.88ms   384.6MB

So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory.

_{NOTE: Gains in speed and memory use might differ in different setups.}

Used data:

# initial datasets
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14)

# large datasets for the benchmark
set.seed(2019)
AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE))
BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE))

Great answer. Just to confirm, I assume the "i" in "A[B, bb:=i.b, on='a']" refers to the"i" in the general data.table "DT[i, j, by]" syntax? — cbailiss, May 14 '17 at 07:18
@cbailiss Yes, `i.b` mean that in updating `A` with the join it should look t the `b`-column of `B`. In a similar way, with the `x.` prefix you can refer to columns of `A`. — Jaap, May 14 '17 at 09:11
@Jaap How would you manage the join by reference when there are multiple new columns created? Here the new column `bb := i.b` is created, which as you stated looks up the corresponding `b` column value in the `B` `data.table` corresponding to `i`. But what happens when you have many new columns that would potentially be created from merging (by reference) larger `data.tables`? — Prevost, Nov 02 '18 at 14:43
@Prevost [see here for an example](https://stackoverflow.com/a/37994369/2204410), I hope that answers your question — Jaap, Nov 02 '18 at 15:13
@Jaap It does answer it, thank you. I was hoping for a less typing intensive way, but I guess the trade off is typing for a by reference update. If I have been piping `data.table` operations to create columns I typically pipe in a `merge()` where available, which avoids assigning a new `data.table`. But with smaller `data.tables` or where a new `data.table` must be created from a join I may use your answer above or stick with `merge()`. Thanks for the quick comment! — Prevost, Nov 02 '18 at 16:38
@Prevost the tric is in using `mget`, see also the last part of my answer under the link from my previous comment — Jaap, Nov 02 '18 at 20:09
I've never seen this `i.` and `x.` syntax from `data.table` documentation before. Great answer — Matthew Son, Nov 03 '20 at 13:44

score 23 · Accepted Answer · edited Dec 18 '18 at 11:54

23

You can try this:

# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a') 

B[A]

edited Dec 18 '18 at 11:54

Jaap

81,064
34
182
193

answered Jan 04 '16 at 19:21

danas.zuokas

4,551
4
29
39

This answer works fine if one data table key is subset of the other. Is there possibility to join if they intersect partially? For example if `A, B` are like: `A <- data.table(a = 1:4, b = 12:15) B <- data.table(a = 2:5, c = 13:16) ` – Lodyk Vovchak May 28 '20 at 09:48

ToWii · Answer 3 · 2021-12-24T07:59:42.873

2

For the sake of completeness, I add the table.express version of an answer to your questions. table.express nicely extends the tidyverse language to data.table making it a handy tool to work fastly with huge datasets. Here is the solution using your datasets from the question above:

merge_test = dataA %>% left_join(dataB, by="A")

A left_join keeps all rows from dataA in the joined dataset.

Note: You must load the packages data.table and table.express.

edited Dec 24 '21 at 07:59

answered Dec 07 '20 at 12:28

ToWii

590
5
8

1

this is great! thanks - I had not heard of this package. – nate-m Mar 13 '22 at 12:42
1

Can you explan how this works? Within tidyverse I could do `left_join(table1, table2, by = c("colname" = "colname")`. This doesn't work with table.express. I need an `on =` argument. However, I can't find any in the documentation. – gernophil Jun 30 '22 at 12:40
@gernophil please have a look at the [join documentation](https://asardaes.github.io/table.express/articles/joins.html) for `table.express`. The second argument in `left_join`, which is `by="A"` above, is the `on =` argument you refer to. Optionally, you can use the variable name without quotes. – ToWii Jul 03 '22 at 10:50

Left join using data.table

3 Answers3

Linked

Related