Reference the answer of @Frank here:
Which data.table syntax for left join (one column) to prefer
, and the answer of @Jaap here:
left join using data.table
I think this is dangerous territory for somebody actually looking for a left join, and coming to this community for answers. The update by reference examples (solutions using :=
) provided in the links (same method used by OP here) is not really a left join. The left join returns all records from the left table, and the matched records from the right table, but this data.table
syntax we are discussing does not actually return all the matched records from the right table, hence not being a left join.
The data.table documentation clearly says that a left join is:
X[DT, on="x"] # left join
, which will make the correct data.table OP is looking:
test[test2, on="carb"] # or,
test[test2[, wt2 := wt][,c("carb", "wt2")], on="carb"] # left join
@Jaap explains in his answer the difference between joining by reference (using the :=
notation) and the regular left join:
Although this doesn't make a noticeable difference with small datasets
like these, it does make a difference on large datasets for which
data.table was designed.
There are two statements here that are scary, in my eyes. One is that there is indeed a noticeable difference - in results - because the results using the two methods are different (although I understand he is talking about speed, edit: as @jangorecki points out, it is not about speed, but memory usage. Updating by reference does not create a second object in memory). The second thing is the assumption (in my eyes, respectfully) is that if one is working with large datasets and the intent is to do a left join, I have never met anybody that does not want all the matches in the right table.
I’m unaware of how can one actually do a left join using updating by reference in ‘data.table’. My reasoning is that :=
is defined for use in j only (data.table
syntax). It adds or updates or removes column(s) by reference. It makes no copies of any part of memory at all. But since we potentially have to add additional rows to the left table if we get multiple matches in the right table (if we are intending to do an actual left join), si don’t think the :=
operator can be used, because it works on columns