4

Let me first start by saying that this is the first question that I am posting on stack overflow. Please let me know if I need to change the style,format etc in my questions.

I would like to do a left outer join operation on two data tables with an exra condition that I be allowed to have different names for the key variables in the two data tables. Example:

DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"),   x2a=1:6,m1=seq(10,60,by=10))
setkey(DT1,x1,x2a)
> DT1
   x1 x2a m1
1:  a   3 30
2:  a   5 50
3:  b   1 10
4:  b   4 40
5:  b   6 60
6:  c   2 20
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6," "," "),m2=5:10)
setkey(DT2,x1,x2b)
> DT2
   x1 x2b m2
1:  a      9
2:  a     10
3:  b   1  5
4:  b   6  8
5:  c   7  7
6:  d   4  6
############# first, I use the merge operation on the data frames to do a left outer join
dfL<-merge.data.frame(DT1,DT2,by.x=c('x1','x2a'),by.y=c('x1','x2b'),all.x=TRUE)
> dfL
  x1 x2a m1 m2
1  a   3 30 NA
2  a   5 50 NA
3  b   1 10  5
4  b   4 40 NA
5  b   6 60  8
6  c   2 20 NA
################# attempt with data table left outer join 
> dtL<-DT2[DT1,on=c("x1","x2a")]
Error in forderv(x, by = rightcols) : 
  'by' value -2147483648 out of range [1,3]

#################### code that works with data table
DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"), x2=as.character(1:6),m1=seq(10,60,by=10))
setkey(DT1,x1,x2)
DT1
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2=c(1,4,7,6," "," ") ,m2=5:10)
setkey(DT2,x1,x2)
DT2
dtL<-DT2[DT1]
######################## this required identical naming of the key variables in the two data tables
################### Also does not allow a ad-hoc selection of the key variables with the "on" argument

I would like to know if it is possible to retain the flexibility of the merge command with data frames. With data.table.

Vivek
  • 107
  • 1
  • 7
  • 1
    with `data.table` loaded you can use the `merge()` syntax and specify the join columns without the need to set the keys. see `?data.table::merge` – tospig Jan 06 '16 at 23:25
  • 1
    i.e., `library(data.table); merge(DT1, DT2, by.x=c('x1','x2a'), by.y=c('x1','x2b'), all.x=TRUE)` – tospig Jan 06 '16 at 23:29

2 Answers2

16

As from v1.9.6 (on CRAN Sep 2015) :

X[Y] syntax can now join without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column "y" of DT2 with "x" of DT1. DT1[DT2, on="y"] would join column "y" of both data.tables.

Please search the README for the string "on=" for further items relating to on= in v1.9.7 in development. You asked for left outer join, which is the default with X[Y] syntax (X[Y,nomatch=0] switches to inner join).

But note that setting a key is always going to faster to subsequently join to if you can pay the upfront cost of setkey (which is pretty fast in data.table). The concept is similar to a clustered index in SQL.

The advantage of X[Y] syntax vs merge() is that you can include j and by inside the same [...] query. Such a query is optimized so only the columns needed by the j expression are joined, saving time and RAM. See FAQ 1.12. by=.EACHI is also a powerful feature of X[Y] syntax you don't get with merge() and currently unique to data.table. You can use on= together with by=.EACHI inside a DT[...] query.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Is there an html version of the FAQ, or is it only available as a .pdf ([as linked here in **support** > **Links**](https://github.com/Rdatatable/data.table/wiki/Support))? – tospig Jan 07 '16 at 02:07
  • See FAQ 6.10 :-) It's also on the [CRAN page](https://cran.r-project.org/web/packages/data.table/index.html). – Matt Dowle Jan 07 '16 at 02:53
  • haha - fair enough. I'm still getting my head around FAQ 1.12 so haven't got that far yet :) – tospig Jan 07 '16 at 05:27
  • 1
    @Matt Dowle I get an error message with rtL2<-DT2[DT1,on=c('x1',x2a='x2b')] What am I missing? Does this work only on version 1.9.7? I have 1.9.6. – Vivek Jan 08 '16 at 12:45
  • 1
    @Vivek Please ask a new question if it survives the steps in [Support](https://github.com/Rdatatable/data.table/wiki/Support); e.g. follow point 6 to include the error message and point 4 under Links section. – Matt Dowle Jan 08 '16 at 20:03
  • @tospig New [HTML FAQ](https://github.com/Rdatatable/data.table/wiki/FAQ) created on the wiki and intention is to move the PDF FAQ onto there. Since you asked for it, can you check it's ok please and will work for you? It has anchors and anyone can add/edit easily. – Matt Dowle Jan 22 '16 at 23:30
2

From ?data.table::merge

This merge method for data.table behaves very similarly to that of data.frames with one major exception: By default, the columns used to merge the data.tables are the shared key columns rather than the shared columns with the same names. Set the by, or by.x, by.y arguments explicitly to override this default.

So we can use the by arguments to override the keys.

library(data.table)

DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"),   x2a=1:6,m1=seq(10,60,by=10))
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6," "," "),m2=5:10)

## you will get an error when joining a character to a integer:
DT2$x2b <- as.integer(DT2$x2b)
## Alternative:
## DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6,NA,NA),m2=5:10)

merge(DT1, DT2, by.x=c('x1','x2a'), by.y=c('x1','x2b'), all.x=TRUE)

   x1 x2a m1 m2
1:  a   3 30 NA
2:  a   5 50 NA
3:  b   1 10  5
4:  b   4 40 NA
5:  b   6 60  8
6:  c   2 20 NA
tospig
  • 7,762
  • 14
  • 40
  • 79