5

Suppose you have data like

fruits <- data.table(FruitID=c(1,2,3), Fruit=c("Apple", "Banana", "Strawberry"))
colors <- data.table(ColorID=c(1,2,3,4,5), FruitID=c(1,1,1,2,3), Color=c("Red","Yellow","Green","Yellow","Red"))
tastes <- data.table(TasteID=c(1,2,3), FruitID=c(1,1,3), Taste=c("Sweeet", "Sour", "Sweet"))

setkey(fruits, "FruitID")
setkey(colors, "ColorID")
setkey(tastes, "TasteID")

fruits
   FruitID      Fruit
1:       1      Apple
2:       2     Banana
3:       3 Strawberry

colors
   ColorID FruitID  Color
1:       1       1    Red
2:       2       1 Yellow
3:       3       1  Green
4:       4       2 Yellow
5:       5       3    Red

tastes
   TasteID FruitID  Taste
1:       1       1 Sweeet
2:       2       1   Sour
3:       3       3  Sweet

I typically need to perform left-outer joins on data like this. For instance, "give me all fruits and their colors" requires me to write (and maybe there's a better way?)

setkey(colors, "FruitID")
result <- colors[fruits, allow.cartesian=TRUE]
setkey(colors, "ColorID")

Three lines of code for such a simple and frequent task seemed excessive, so I wrote a method myLeftJoin

myLeftJoin <- function(tbl1, tbl2){
  # Performs a left join using the key in tbl1 (i.e. keeps all rows from tbl1 and only matching rows from tbl2)

  oldkey <- key(tbl2)
  setkeyv(tbl2, key(tbl1))
  result <- tbl2[tbl1, allow.cartesian=TRUE]
  setkeyv(tbl2, oldkey)
  return(result)
}

which I can use like

myLeftJoin(fruits, colors)
   ColorID FruitID  Color      Fruit
1:       1       1    Red      Apple
2:       2       1 Yellow      Apple
3:       3       1  Green      Apple
4:       4       2 Yellow     Banana
5:       5       3    Red Strawberry

How can I extend this method so that I can pass any number of tables to it and get the chained left outer join of all of them? Something like myLeftJoin(tbl1, ...)

For instance, I'd like the result of myleftJoin(fruits, colors, tastes) to be equivalent to

setkey(colors, "FruitID")
setkey(tastes, "FruitID")
result <- tastes[colors[fruits, allow.cartesian=TRUE], allow.cartesian=TRUE]
setkey(tastes, "TasteID")
setkey(colors, "ColorID")

result
   TasteID FruitID  Taste ColorID  Color      Fruit
1:       1       1 Sweeet       1    Red      Apple
2:       2       1   Sour       1    Red      Apple
3:       1       1 Sweeet       2 Yellow      Apple
4:       2       1   Sour       2 Yellow      Apple
5:       1       1 Sweeet       3  Green      Apple
6:       2       1   Sour       3  Green      Apple
7:      NA       2     NA       4 Yellow     Banana
8:       3       3  Sweet       5    Red Strawberry

Perhaps there's an elegant solution using methods in the data.table package that I missed? Thanks

(EDIT: Fixed a mistake in my data)

Ben
  • 20,038
  • 30
  • 112
  • 189

2 Answers2

12

I just committed a new feature in data.table, v1.9.5, with which we can join without setting keys (that is, specify the columns to join by directly, without having to use setkey() first):

With that, this is simply:

require(data.table) # v1.9.5+
fruits[tastes, on="FruitID"][colors, on="FruitID"] # no setkey required
#    FruitID      Fruit TasteID  Taste ColorID  Color
# 1:       1      Apple       1 Sweeet       1    Red
# 2:       1      Apple       2   Sour       1    Red
# 3:       1      Apple       1 Sweeet       2 Yellow
# 4:       1      Apple       2   Sour       2 Yellow
# 5:       1      Apple       1 Sweeet       3  Green
# 6:       1      Apple       2   Sour       3  Green
# 7:       2         NA      NA     NA       4 Yellow
# 8:       3 Strawberry       3  Sweet       5    Red
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Hi Arun. In the extended scenario where we have a list of N data.tables (instead of 3 as in this thread) and the same target (join all on a specified variable), is there a more convenient way than linking N expressions? – JDG Aug 21 '19 at 09:59
  • 2
    @J.G., yes, simply use a for-loop. Assuming 'a', 'b' and 'c' are your tables and joining on col 'x': `ans <- copy(a); for (el in list(b, c)) ans <- ans[el, on="x"]` – Arun Sep 26 '19 at 18:12
7

You could use base R's Reduce to left_join (from dplyr) the list of data.table objects at once given that, you are joining the tables with common column names and willing to avoid setting keys multiple times for data.table objects

library(data.table) # <= v1.9.4
library(dplyr) # left_join

Reduce(function(...) left_join(...), list(fruits,colors,tastes))

# Source: local data table [8 x 6]

#  FruitID      Fruit ColorID  Color TasteID  Taste
#1       1      Apple       1    Red       1 Sweeet
#2       1      Apple       1    Red       2   Sour
#3       1      Apple       2 Yellow       1 Sweeet
#4       1      Apple       2 Yellow       2   Sour
#5       1      Apple       3  Green       1 Sweeet
#6       1      Apple       3  Green       2   Sour
#7       2     Banana       4 Yellow      NA     NA
#8       3 Strawberry       5    Red       3  Sweet

Another option with pure data.table approach as @Frank mentioned (Note, this requires the keys to be set to fruitID for all the data.table objects)

library(data.table) # <= v1.9.4
Reduce(function(x,y) y[x, allow.cartesian=TRUE], list(fruits,colors,tastes))
Veerendra Gadekar
  • 4,452
  • 19
  • 24
  • np, btw you might be on an older `data.table` version as I don't see the need for `allow.cartesian=TRUE` – eddi Jul 17 '15 at 18:46
  • @eddi Yeah, me and the OP both needed `allow.cartesian` (as seen up there) – Frank Jul 17 '15 at 18:50
  • Thanks for the solution. When I run `Reduce(function(x,y) y[x, allow.cartesian=TRUE], list(fruits,colors,tastes))` I only get three rows of output. Should be 8. - oh wait - just read the "provided all keys are set to fruitID" part but that's exactly what I'm trying to avoid. – Ben Jul 17 '15 at 18:58
  • @Ben then what exactly you are trying to do? if you want to have a control over the columns with which you want to join then you could use `by` inside `left_join` like this `Reduce(function(...) left_join(..., by = "FruitID"), list(fruits,colors,tastes))`. btw, both the solutions results exactly as the desired output you mentioned in the question – Veerendra Gadekar Jul 17 '15 at 20:15
  • @VeerendraGadekar While Frank's solution works, it requires multiple `setkey` statements in my code every time I want to use it. I'm looking for a data.table based solution that avoids that problem. (Imaging all the setkey statements if you had 10 different tables that you wanted to left join in 10 different ways.) – Ben Jul 17 '15 at 20:43
  • @Ben the first solution given by me works fine in a single line and is equally efficient as that of pure `data.table` approach. I don't see any reason why not to use it, unless you purely want to stick to `data.table` – Veerendra Gadekar Jul 17 '15 at 20:54