4

I'm wondering why data.table's left join is not giving me multiple matches, it seems there is some weird built in "no duplicates", which does not really make it a left join, does it?

data:

test=data.table(mtcars[1,])
test2=data.table(mtcars[c(1,1),])

data.table:

test[test2, on = c(carb = "carb"), wt2 := i.wt]

dplyr:

test %>% left_join(test2 %>% select(carb, wt) %>% rename(wt2 = wt), 
                       by = "carb")

dplyr gives me the correct two row result, but data.table gives me only one row. What is going on here?

Helen
  • 533
  • 12
  • 37
  • shouldn't it be `test2[test, on = c(carb = "carb"), wt2 := i.wt]` – AnilGoyal Jan 23 '21 at 07:06
  • @AnilGoyal: No, not in this example, no. The dplyr join is equal to left_join(test, test2), same with the data.table one. – Helen Jan 23 '21 at 07:07
  • See section 13.4.4 of [R4DS](https://r4ds.had.co.nz/relational-data.html) (by author of tidyverse) it explains that tidyr evaluates many to many relationships with cartesian product i.e. all possible combinations. I am not much aware with data.table but seems it may behave differently for such situations – AnilGoyal Jan 23 '21 at 07:20
  • @AnilGoyal Well, tidyr/tidyverse/dplyr using left_join works well. this is indeed a left join, so that is not the issue as I see it. dplyr works as intended. it is the data.table that does not work as a left join, yet it is claimed that this is the left join of data.table, which is just wrong – Helen Jan 23 '21 at 07:22
  • e.g. see answer from user @Jaap here: https://stackoverflow.com/a/34600831 – Helen Jan 23 '21 at 07:23

2 Answers2

5

Reference the answer of @Frank here: Which data.table syntax for left join (one column) to prefer

, and the answer of @Jaap here: left join using data.table

I think this is dangerous territory for somebody actually looking for a left join, and coming to this community for answers. The update by reference examples (solutions using :=) provided in the links (same method used by OP here) is not really a left join. The left join returns all records from the left table, and the matched records from the right table, but this data.table syntax we are discussing does not actually return all the matched records from the right table, hence not being a left join.

The data.table documentation clearly says that a left join is:

X[DT, on="x"] # left join

, which will make the correct data.table OP is looking:

test[test2, on="carb"] # or,
test[test2[, wt2 := wt][,c("carb", "wt2")], on="carb"] # left join

@Jaap explains in his answer the difference between joining by reference (using the := notation) and the regular left join:

Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table was designed.

There are two statements here that are scary, in my eyes. One is that there is indeed a noticeable difference - in results - because the results using the two methods are different (although I understand he is talking about speed, edit: as @jangorecki points out, it is not about speed, but memory usage. Updating by reference does not create a second object in memory). The second thing is the assumption (in my eyes, respectfully) is that if one is working with large datasets and the intent is to do a left join, I have never met anybody that does not want all the matches in the right table.

I’m unaware of how can one actually do a left join using updating by reference in ‘data.table’. My reasoning is that := is defined for use in j only (data.table syntax). It adds or updates or removes column(s) by reference. It makes no copies of any part of memory at all. But since we potentially have to add additional rows to the left table if we get multiple matches in the right table (if we are intending to do an actual left join), si don’t think the := operator can be used, because it works on columns

Helen
  • 533
  • 12
  • 37
  • if you have a question you should make it a question (either here on gh repo) rather than an answer here. – jangorecki Jan 23 '21 at 11:15
  • 1
    @jangorecki that is your reason for downvoting? After what is an actual attempt of putting light on an important issue? I would much rather just point that out politely, so I can remove it – Helen Jan 23 '21 at 11:25
  • This is not the reason for downvoting. Good answer should explain why update-on-join (using :=) cannot have multiple matches, from the logical point of view (not specific to DT, same happens for sqlite). Otherwise users will not understand why it is like this and they will only try to remember which interface to use when (specific to DT). I am looking forward to upvote but for such a good question I expect good answer. – jangorecki Jan 23 '21 at 11:34
  • 1
    @jangorecki fair enough, I’ll try to explain better, thanks for pointing that out – Helen Jan 23 '21 at 11:36
  • 1
    https://github.com/Rdatatable/data.table/issues/3747 this can be helpful and https://rdatatable.gitlab.io/data.table/library/data.table/html/assign.html explains this behavior as well – jangorecki Jan 23 '21 at 11:38
  • @jangorecki: I've tried to explain (as far as I know) why it will be hard to do an actual left join by reference. not 100% if it is correct, but I've tried. If you and other think my question are actually good, why is the community not upvoting my original question? I'm not saying I have the answer here, I'm just curious here, and even though I might be offending some of the answers I've linked to, well, my apologies, but I would much rather have a good discussion where people could explain to me where my reasoning is flawed. Thanks for the links, I'll take a look! – Helen Jan 23 '21 at 11:50
  • Thanks for the links @jangorecki , I did not know this has been discussed before, appreciate it! – Helen Jan 23 '21 at 11:58
  • Speed is not really important in update-on-join but the memory is. You may not be able to do left join when you are low in memory, but may be still able to do update-on-join. You don't need to allocate results as new table when doing update-on-join. Left join on the other side creates new data as results, thus it can also extend number of rows (in case of duplicate matches). Even if there are no duplicate matches it still generate new data so needs more memory. – jangorecki Jan 24 '21 at 09:52
  • @Helen, your question/answer was **very** helpful. Until I read your answer, I didn't realize that the X[DT, var. := i.var] syntax doesn't return a left join. I see the logic of why you can't perform a left-join by reference *now*, but I didn't think about it before. I agree that those two answers you reference in your post (and others! it's a widespread misconception) are dangerous; I found those answers before yours and completely misunderstood left-join vs update-on-join. – Eli Oct 14 '22 at 16:15
2

The equivalent of the dplyr join

test %>% left_join(test2 %>% select(carb, wt) %>% rename(wt2 = wt), 
                       by = "carb")

in data.table is

test[test2[, .(carb, wt2 = wt)], on = 'carb']
pseudospin
  • 2,737
  • 1
  • 4
  • 19