0

Consider two data tables where the number of key columns differ:

library(data.table)
tmp_dt <- data.table(group1 = letters[1:5], group2 = c(1, 1, 2, 2, 2), a = rnorm(5), key = c("group1", "group2"))
tmp_dt2 <- data.table(group2 = c(1, 2, 3), color = c("r", "g", "b"), key = "group2")

I want to join tmp_dt to tmp_dt2 by group2, however the following fails:

tmp_dt[tmp_dt2]

> tmp_dt[tmp_dt2]
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'group1' is a character column being joined to i.'group2' which is type 'double'. Character columns must join to factor or character columns.

This makes sense since it tries to join the data tables on the first key variable. How do I fix it so that the behaviour is the same as dplyr::inner_join, without incurring overheads in resetting the key on tmp_dt twice?

> inner_join(tmp_dt, tmp_dt2, by = "group2")
  group1 group2          a color
1      a      1  0.2501413     r
2      b      1  0.6182433     r
3      c      2 -0.1726235     g
4      d      2 -2.2239003     g
5      e      2 -1.2636144     g
Frank
  • 66,179
  • 8
  • 96
  • 180
Alex
  • 15,186
  • 15
  • 73
  • 127
  • `...key = c("group2","group1")` – Jean Dec 20 '16 at 02:43
  • yes, but the key (pardon the pun) is that I do not wish for my data table rows to be ordered in that order. – Alex Dec 20 '16 at 02:44
  • 2
    Don't set keys at all. The current idiom is to use the `on=` argument. Check the docs for `[.data.table`. Also maybe: http://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table – Frank Dec 20 '16 at 02:54
  • ah thanks, did not realised I could do that. Can `on = ` be used even in the presence of keys? – Alex Dec 20 '16 at 03:03
  • Yes, it can. In rare cases, I guess there are still efficiency benefits from keying; but I mostly just use it for tables I will be manually joining on many times. – Frank Dec 20 '16 at 08:07
  • or falling back to a more explicit function call, `merge(tmp_dt, tmp_dt2, by = "group2")` works as well. – Aramis7d Dec 20 '16 at 08:10

2 Answers2

1

Using lapply

tmp_dt[,color:=unlist(lapply(.BY, function(x) tmp_dt2[group2==x, color])), by=group2]

As pointed out by Frank in the comments, using on

tmp_dt[tmp_dt2, on="group2"]

tmp_dt2[tmp_dt, on="group2"]

Using on is roughly twice as fast as lapply using .BY. Although the first example returns a sixth row of NA 3 NA b

manotheshark
  • 4,297
  • 17
  • 30
0

You should use this code

tmp_dt2[tmp_dt, on = 'group2']
Gauss.Y
  • 100
  • 6