5

Morning everyone

In data.table I found that with a left join, when mentioning a column name implicitly i.e. without mentioning the table (in which the column resides in) induces unexpected results despite unique column names.

dummy data

x <- data.table(a = 1:2); x
#    a
# 1: 1
# 2: 2
y <- data.table(c = 1
                ,d = 2); y 
#    c d
# 1: 1 2

left join without mentioning table name in retrieve of column c

z <- y[x, on=.(c=a), .(a,c,d)];  z
#    a c  d
# 1: 1 1  2
# 2: 2 2 NA

Problem arises when looking at results above. Row 2 of column c is supposed to be NA. However, it shows 2 This is only rectified when the user explicitly mentions the table:

z <- y[x, on=.(c=a), .(a,x.c,d)];  z
#    a x.c  d
# 1: 1   1  2
# 2: 2  NA NA

It is perhaps worth mentioning the x in x.c is referring to the position of syntax x[i], in this case, table y

My question is why is the explicit mention of table necessary for a task seemingly basic. Or am I missing something? Thank you.

Sweepy Dodo
  • 1,761
  • 9
  • 15
  • As I see it the result in the first join is perfectly correct as the columns to merge on are interpreted as being the same even thought they have different names. It seems to me that more than a join you are looking for something like `cbind.fill(x, y, fill = NA)` from library `rowr`. – Flavia Jul 03 '19 at 09:26
  • 4
    [This issue on GitHub gives additional information on this behavior](https://github.com/Rdatatable/data.table/pull/2706). – Jaap Jul 03 '19 at 09:37
  • 3
    From a highly recommended [tutorial by @Frank](http://franknarf1.github.io/r-tutorial/_book/), [chapter on equi-joins](http://franknarf1.github.io/r-tutorial/_book/tables.html#equi-joins): "When merging on columns with different names, they must be written in `on=` like `x = y` where `x` is from the “left” table, and `y` from the “right” one. Because we are using `i` to lookup rows in `x`, **the displayed column will have its name from `x` and its values from `i`**. – Henrik Jul 03 '19 at 09:47
  • 3
    [Related SO post by @Arun](https://stackoverflow.com/a/44343424/1851712): "In data.table, joins of the form `x[i]` traditionally uses **values from `i` but uses column names from `x`**" – Henrik Jul 03 '19 at 09:49
  • 2
    @Jaap@Henrik. Thank you for pointing out the default working of keeping values from i but name from x and the links to other questions and github. I hate to have asked a duplicate but really did not come across these links. Should have looked harder. Look forward to the next update as I am interested in data.table for performance gain – Sweepy Dodo Jul 03 '19 at 10:00

0 Answers0