6

I'm sure I'm overlooking the obvious, but I can't find a way to join all the columns of the "lookup" table on a data.table non-equi join in one single step.

I looked at Arun's presentation (https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanSatRdaysBudapest2016.pdf) and multiple SO questions, but nearly all of them only deal with updating a single column, not joining multiple.

Suppose I have 2 data.tables a and b:

library(data.table)
a <- data.table(Company_ID = c(1,1,1,1),
            salary = c(2000, 3000, 4000, 5000))

#   Company_ID salary
# 1:          1   2000
# 2:          1   3000
# 3:          1   4000
# 4:          1   5000

b <- data.table(cat = c(1,2),
            LB = c(0, 3000),
            UB = c(3000,5000),
            rep = c("Bob","Alice"))

#    cat   LB   UB   rep
# 1:   1    0 3000   Bob
# 2:   2 3000 5000 Alice

What I want in the end is matching the cat, LB, UB, rep (all cols in b) to table a:

#    Company_ID salary cat   LB   UB   rep
# 1:          1   2000   1    0 3000   Bob
# 2:          1   3000   2 3000 5000 Alice
# 3:          1   4000   2 3000 5000 Alice

Currently, I the only way I manage to do it is with the following two lines:

a <- a[b, on = .(salary >= LB, salary < UB), cat := cat]
a[b, on = .(cat == cat)]

Which outputs the desired table, but seems cumbersome and not at all like a data.table approach. Any help would be greatly appreciated!

Frank
  • 66,179
  • 8
  • 96
  • 180
bendae
  • 779
  • 3
  • 13
  • 1
    Side note: You should / don't need to do `a <- a[..., ... := ...]` since the `:=` operator modifies `a` by reference. – Frank Jan 13 '17 at 15:53
  • 1
    that is certainly correct, thank you - a habit of the normal join X[Y] syntax, I guess – bendae Jan 13 '17 at 16:00

2 Answers2

8

Since you want results for every row of a, you should do a join like b[a, ...]:

b[a, on=.(LB <= salary, UB > salary), nomatch=0, 
  .(Company_ID, salary, cat, LB = x.LB, UB = x.UB, rep)]

   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice
  • nomatch=0 means we'll drop rows of a that are unmatched in b.
  • We need to explicitly refer to the UB and LB columns from b using the x.* prefix (coming from the ?data.table docs, where the arguments are named like x[i]).

Regarding the strange default cols, there is an open issue to change that behavior: #1615.


(Issue #1989, referenced below, is fixed now -- See Uwe's answer.)

Alternately... One way that should work and avoids explicitly listing all columns: add a's columns to b, then subset b:

b[a, on=.(LB <= salary, UB > salary), names(a) := mget(paste0("i.", names(a)))] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

There are two problems with this. First, there's a bug causing non-equi join to break when confronted with mget (#1989). The temporary workaround is to enumerate a's columns:

b[a, on=.(LB <= salary, UB > salary), `:=`(Company_ID = i.Company_ID, salary = i.salary)] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

Second, it's inefficient to do this join twice (once for := and a second time for which), but I can't see any way around that... maybe justifying a feature request to allow both j and which?

Frank
  • 66,179
  • 8
  • 96
  • 180
  • Thanks, Frank, it works as you described. Do I assume correctly, then, that you have to explicitly specify all columns with the `.(...)` syntax in `J`? If I read Arun's comment on the issue you linked correctly, that is by design, as non-equi-joins often return a lot more rows and thus returning all columns is not suitable by default? – bendae Jan 13 '17 at 16:09
  • @bendae I'm trying to think if there's a workaround... I'll edit if I think of it. In the meantime, please leave your question open / don't accept this answer, since someone else may know a way. Arun's comment was directed at the previous commenter (who was interested in having non-equi join in `merge` instead of `[`). – Frank Jan 13 '17 at 16:13
  • Thank you again! If nothing else comes up over the next few days I'll accept your answer. – bendae Jan 13 '17 at 16:16
4

Now, that #1989 has been fixed with data.table version 1.12.3 (in development) it is possible to pick all columns from a and b without stating each column name explicitely:

a[b, on = .(salary >= LB, salary < UB), 
  mget(c(paste0("x.", names(a)), paste0("i.", names(b))))]
   x.Company_ID x.salary i.cat i.LB i.UB i.rep
1:            1     2000     1    0 3000   Bob
2:            1     3000     2 3000 5000 Alice
3:            1     4000     2 3000 5000 Alice

which returns OP's expected result except for the column headers.

To change the column headers, setnames() from the data.table package can be used:

result <- a[b, on = .(salary >= LB, salary < UB), 
            mget(c(paste0("x.", names(a)), paste0("i.", names(b))))] 
setnames(result, c(names(a), names(b)))
result
   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice

or, with piping and using set_names() from the magrittr package

library(magrittr)
a[b, on = .(salary >= LB, salary < UB), 
  mget(c(paste0("x.", names(a)), paste0("i.", names(b)))) %>% 
    set_names(c(names(a), names(b)))]
   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice

Admittedly, this is still cumbersome.

Uwe
  • 41,420
  • 11
  • 90
  • 134