1

Consider some sample data

library(data.table)
dt1 <- data.table(foo = 1:4, bar = letters[1:4])
dt2 <- data.table(foo1 = 2:5, bar1 = LETTERS[1:4])

I am trying to join these two sample data.tables by non-equi join (which also needs me to set cartesian join option):

options("datatable.allow.cartesian" = T)
dt3 <- dt1[dt2, on = .(foo < foo1), nomatch = 0L]

dt3 comes out to be:

    foo bar bar1
 1:   2   a    A
 2:   3   a    B
 3:   3   b    B
 4:   4   a    C
 5:   4   b    C
 6:   4   c    C
 7:   5   a    D
 8:   5   b    D
 9:   5   c    D
10:   5   d    D

However there are 2 problems with this:

  1. column foo don't have value 5 (thats an obvious bug)
  2. column foo1 is not present in output (which could be limiting for further conditions; if any)

In order to bypass second condition I tried:

dt2[, foo11 := foo1]
dt4 <- dt1[dt2, on = .(foo < foo1), nomatch = 0L]
options("datatable.allow.cartesian" = F)

which gives me dt4:

    foo bar bar1 foo11
 1:   2   a    A     2
 2:   3   a    B     3
 3:   3   b    B     3
 4:   4   a    C     4
 5:   4   b    C     4
 6:   4   c    C     4
 7:   5   a    D     5
 8:   5   b    D     5
 9:   5   c    D     5
10:   5   d    D     5

So here foo11 is essentially a copy of errorneous column foo which looks like another bug due to non-equi join.

Am I missing any point or doing something wrong here?

abhiieor
  • 3,132
  • 4
  • 30
  • 47

1 Answers1

3

I think the behavior is the expected, except for the variable being named quite surprisingly. I made a little tweak to your sample data to show that everything is alright:

dt1 <- data.table(foo = 1:4, bar = letters[1:4])
dt2 <- data.table(foo1 = 2:4, bar1 = letters[2:4]) # small change here

dt1[dt2, on = .(foo < foo1), allow.cartesian = TRUE][dt1, on = "bar"]
    foo bar bar1 i.foo
 1:   2   a    b     1
 2:   3   a    c     1
 3:   4   a    d     1
 4:   5   a    e     1
 5:   3   b    c     2
 6:   4   b    d     2
 7:   5   b    e     2
 8:   4   c    d     3
 9:   5   c    e     3
10:   5   d    e     4

For me the behavior is the expected, it's just that the first column is named foo instead of foo1. That's why you see fool1 as an erroneous copy of "erroneous foo: it's actually a copy offoo1`.

EDIT: a possible workaround:

This isn't really elegant, but is a workaround:

dt1[dt2, .(foo = x.foo, foo1, bar, bar1), on = .(foo < foo1), allow.cartesian = TRUE]
    foo foo1 bar bar1
 1:   1    2   a    b
 2:   1    3   a    c
 3:   2    3   b    c
 4:   1    4   a    d
 5:   2    4   b    d
 6:   3    4   c    d
 7:   1    5   a    e
 8:   2    5   b    e
 9:   3    5   c    e
10:   4    5   d    e

x.foo retains the original, true foo. foo1 still is what it is, so you can return both variables.

PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • 2
    There's an open issue regarding this naming: https://github.com/Rdatatable/data.table/issues/3437. As per the last release notes and news, the reactions to issues are being considered for prioritization of solutions, so you may consider giving it the "like" (top right corner of the original issue). – PavoDive Oct 11 '19 at 09:44
  • 3
    See also [SQL-like column return for non-equi and rolling joins](https://github.com/Rdatatable/data.table/pull/2706) – Henrik Oct 11 '19 at 09:47
  • Thanks a lot for answering as it unblocks me. Also pointing to the issue which I will like (Along with your answer :)). I like your edit more as it is cleaner with @Henrik link. Original answer to me looks more like a workaround in which you are joining table again. – abhiieor Oct 11 '19 at 09:58