10

Say I have two tables:

library(data.table)
set.seed(1)

tab1 <- data.table(
  let = rep(letters[1:2], each = 3),
  num = rep(1:3, 2),
  val = rnorm(6),
  key = c("let", "num")
)

tab2 <- data.table(
  let = rep(letters[1:2], each = 2),
  num = rep(1:2, 2),
  val = rnorm(4),
  key = c("let", "num")
)

Table 1:

> tab1
   let num        val
1:   a   1 -0.6264538
2:   a   2  0.1836433
3:   a   3 -0.8356286
4:   b   1  1.5952808
5:   b   2  0.3295078
6:   b   3 -0.8204684

Table 2:

> tab2
   let num
1:   a   1
2:   a   2
3:   b   1
4:   b   2

Is there a way to "merge" these tables such that I get all the results in tab1 that are not in tab2?:

   let num        val
1:   a   3 -0.8356286
2:   b   3 -0.8204684
sebastian-c
  • 15,057
  • 3
  • 47
  • 93
  • Relevant post: [How to join (merge) data frames (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871) – zx8754 Nov 12 '15 at 09:18

2 Answers2

18

In this case, it's equivalent to an anti join:

tab1[!tab2, on=c("let", "num")]

But setdiff() would only the first row for every let,num. This is marked for v1.9.8, FR #547.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • is `on` operating like a `setkey` for the merge? – Colonel Beauvel Nov 12 '15 at 11:50
  • Given that my examples already had set keys, this can be reduced to `tab1[!tab2]` – sebastian-c Nov 13 '15 at 09:34
  • @ColonelBeauvel, yes, it does ad-hoc joins. Quite useful when you've large data and don't want to reorder it just to join. Or when order needs to be preserved. Also what columns are being joined on is clear (unlike `setkey` which might happen somewhere else in the code.. And the fact that it's a join operation is also clear. – Arun Nov 13 '15 at 09:40
  • 1
    @sebastian-c, I still prefer `on=` so that it's clear at a later point as to what's being joined with what. – Arun Nov 13 '15 at 09:41
  • Thanks! I started learning with setkey and will consider this option from now! – Colonel Beauvel Nov 13 '15 at 14:02
  • @ColonelBeauvel, yes, I plan to update the vignettes to reflect this change. `setkey()` is mostly for advanced users (in my mind) and when the task involves repeated operations on the same (reordered) data (when having data sorted in RAM is quite beneficial). – Arun Nov 13 '15 at 14:28
  • `fsetdiff()` is now implemented - as shown in [this answer](https://stackoverflow.com/a/28703077/5977215) – SymbolixAU Jan 04 '18 at 00:12
0

One solution would be to do a merge and remove the rows where there are values from tab2

d<-as.data.frame(merge(tab1,tab2,all=T))
t<-is.na(d[,4])
d[t,][,-4]

 let num      val.x
3   a   3 -0.8356286
6   b   3 -0.8204684

Using data.table :

merge(tab1,tab2,all=T)[is.na(val.y), 1:3]

 let num      val.x
1:   a   3 -0.8356286
2:   b   3 -0.8204684
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
etienne
  • 3,648
  • 4
  • 23
  • 37