Which data.table syntax for left join (one column) to prefer

Question

How should I start thinking about which syntax I prefer?

My criteria is efficiency (this is number one) and also readability/maintainability.

This

A <- B[A, on = .(id)] # very concise!

Or that

A[B, on = .(id), comment := i.comment]

Or even (as PoGibas suggests):

A <- merge(A, B, all.x = TRUE)

For completeness then a more basic way is to use match():

A[, comment := B[chmatch(A[["id"]], id), comment]]

Example data:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B <- data.table(id = c("c", "d", "e"), comment = c("big", "slow", "nice"))

If you want to assign the column in place, use the second method — akrun, Jan 22 '19 at 16:11
You are assigning the output to the identifier 'A and the memory location should be different — akrun, Jan 22 '19 at 16:13
We don't know what criteria you would use to prefer one over another. You can read the package vignettes if you want to hear the case for assign/update-by-reference. That's the only join that ever shows up in my workflow, usually like `A[, comment := B[.SD, on=.(id), x.comment]]` — Frank, Jan 22 '19 at 16:23
@Frank Thanks, I didn't know you could use `.SD` like that! just edited with some words on my criteria — s_baldur, Jan 22 '19 at 16:30

Frank · Accepted Answer · 2019-06-04T15:07:32.933

I prefer the "update join" idiom for efficiency and maintainability:**

DT[WHERE, v := FROM[.SD, on=, x.v]]

It's an extension of what is shown in vignette("datatable-reference-semantics") under "Update some rows of columns by reference - sub-assign by reference". Once there is a vignette available on joins, that should also be a good reference.

This is efficient since it only uses the rows selected by WHERE and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=].

It makes my code more readable since I can easily see that the point of the join is to add column v; and I don't have to think through "left"/"right" jargon from SQL or whether the number of rows is preserved after the join.

It is useful for code maintenance since if I later want to find out how DT got a column named v, I can search my code for v :=, while FROM[DT, on=] obscures which new columns are being added. Also, it allows the WHERE condition, while the left join does not. This may be useful, for example, if using FROM to "fill" NAs in an existing column v.

Compared with the other update join approach DT[FROM, on=, v := i.v], I can think of two advantages. First is the option of using the WHERE clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM conditional on the on= rules. Here's an illustration extending the OP's example:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
  id = c("c", "d", "e", "e"), 
  ord = 1:4, 
  comment = c("big", "slow", "nice", "nooice")
)

# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu) 
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu) 
# Detected that j uses these columns: comment,i.comment 
# Assigning to 4 row subset of 10 rows

# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
#   Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)

    id     amount comment comment2
 1:  a 0.20000990    <NA>     <NA>
 2:  b 1.42146573    <NA>     <NA>
 3:  c 0.73047544     big      big
 4:  d 0.04128676    slow     slow
 5:  e 0.82195377  nooice     nice
 6:  f 0.39013550    <NA>   nooice
 7:  g 0.27019768    <NA>     <NA>
 8:  h 0.36017876    <NA>     <NA>
 9:  i 1.81865721    <NA>     <NA>
10:  j 4.86711754    <NA>     <NA>

In the left-join-flavored update, you silently get the final value of comment even though there are two matches for id == "e"; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE with the left-joiny approach is not informative -- it says there are four rows being updated but doesn't say that one row is being updated twice.

I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham's paper.

** In this idiom, the on= part should be filled in with the join column names and rules, like on=.(id) or on=.(from_date >= dt_date). Further join rules can be passed with roll=, mult= and nomatch=. See ?data.table for details. Thanks to @RYoda for noting this point in the comments.

Here is a more complicated example from Matt Dowle explaining roll=: Find time to nearest occurrence of particular value for each row

Another related example: Left join using data.table

Just an addendum: The `on=` part must be filled with the join column names (eg. `on=.(id)`. PS: Excellent answer! — R Yoda, Jan 22 '19 at 19:32
Thanks for your answer! Two comments/questions: (1) I do see the beauty of your method but does it improve over the second original option (`v := i.v`)? (2) The SQL jargon, isn't that bigger than just SQL [...wiki/Relational_algebra](https://en.wikipedia.org/wiki/Relational_algebra)? I would argue it is good to have some set of consistent vocabulary (jargon?) that one can use to communicate in places like this. — s_baldur, Jan 23 '19 at 08:06
@snoram For (1), I edited to add a couple reasons. For (2), I guess it is partly an issue of preferences and background. With a normalized set of tables, I have never found a need for the various SQL joins (outer, inner, full, left, right, etc) or the lingo behind them... and sort of suspect that there's something inefficient in a workflow that requires them (outside of data cleaning). I have some familiarity with set theory and like the relational algebra terms, though (union, semi-, equi-, anti-joins, etc.) — Frank, Jan 23 '19 at 18:03
@Frank I think your answer is a really good resource on joins with data.table. Recently updated to version 1.12.3 and now I get an error when I use `A[, comment2 := B2[A, on=.(id), x.comment]]` saying: _Supplied 11 items to be assigned to 10 items of column 'comment2'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code._ — markus, May 15 '19 at 09:45
@markus Thanks :) That error happens if B2 has duplicate rows for some ids. You can use `mult="first"` or `mult="last"` to select a value that way or otherwise calculate the selected value by some function like `A[, comment2 := B2[A, on=.(id), max(x.comment), by=.EACHI]$V1]`. Let me know if it doesn't make sense — Frank, May 15 '19 at 13:14
I like this approach and I see the upsides. But one still has to `A[!is.na(comment2), comment := comment2]` to update the comment column compared to the classic update-join, no? — alodi, Oct 11 '22 at 09:07
@alodi After seeing the `Supplied 11 items to be assigned to 10 items` warning (maybe now an error?), I would take a look at `B` again and see that it cannot be used to map from id -> comment and then decide how to fix things from there (eg, maybe drop dupes in B and try again). — Frank, Oct 11 '22 at 18:12
For all the great advice given in this answer, this answer is also misleading to new data.table users. OP asks for advice on left join but this answer does *not* result in a typical left join. When there are multiple matches in B, it will not return all matches. You address this indirectly in your second example and by referring to this operation as "left-joiny" instead of "left-join". I would really recommend explicitly saying this is not performing as SQL-style left join, though. — Eli, Oct 16 '22 at 18:20

Which data.table syntax for left join (one column) to prefer

1 Answers1

Linked