How to deal with join keys of a different length and variable ordering?

Question

Consider two data tables where the number of key columns differ:

library(data.table)
tmp_dt <- data.table(group1 = letters[1:5], group2 = c(1, 1, 2, 2, 2), a = rnorm(5), key = c("group1", "group2"))
tmp_dt2 <- data.table(group2 = c(1, 2, 3), color = c("r", "g", "b"), key = "group2")

I want to join tmp_dt to tmp_dt2 by group2, however the following fails:

tmp_dt[tmp_dt2]

> tmp_dt[tmp_dt2]
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'group1' is a character column being joined to i.'group2' which is type 'double'. Character columns must join to factor or character columns.

This makes sense since it tries to join the data tables on the first key variable. How do I fix it so that the behaviour is the same as dplyr::inner_join, without incurring overheads in resetting the key on tmp_dt twice?

> inner_join(tmp_dt, tmp_dt2, by = "group2")
  group1 group2          a color
1      a      1  0.2501413     r
2      b      1  0.6182433     r
3      c      2 -0.1726235     g
4      d      2 -2.2239003     g
5      e      2 -1.2636144     g

yes, but the key (pardon the pun) is that I do not wish for my data table rows to be ordered in that order. — Alex, Dec 20 '16 at 02:44
Don't set keys at all. The current idiom is to use the `on=` argument. Check the docs for `[.data.table`. Also maybe: http://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table — Frank, Dec 20 '16 at 02:54
ah thanks, did not realised I could do that. Can `on = ` be used even in the presence of keys? — Alex, Dec 20 '16 at 03:03
Yes, it can. In rare cases, I guess there are still efficiency benefits from keying; but I mostly just use it for tables I will be manually joining on many times. — Frank, Dec 20 '16 at 08:07
or falling back to a more explicit function call, `merge(tmp_dt, tmp_dt2, by = "group2")` works as well. — Aramis7d, Dec 20 '16 at 08:10

manotheshark · Accepted Answer · 2016-12-20T13:32:58.710

1

Using lapply

tmp_dt[,color:=unlist(lapply(.BY, function(x) tmp_dt2[group2==x, color])), by=group2]

As pointed out by Frank in the comments, using on

tmp_dt[tmp_dt2, on="group2"]

tmp_dt2[tmp_dt, on="group2"]

Using on is roughly twice as fast as lapply using .BY. Although the first example returns a sixth row of NA 3 NA b

edited Dec 20 '16 at 13:32

answered Dec 20 '16 at 04:15

manotheshark

4,297
17
30

score 0 · Answer 2 · answered Dec 20 '16 at 07:42

0

You should use this code

tmp_dt2[tmp_dt, on = 'group2']

answered Dec 20 '16 at 07:42

Gauss.Y

100
6

How to deal with join keys of a different length and variable ordering?

2 Answers2