How can I perform a "setdiff" merge using data.table?

Question

Say I have two tables:

library(data.table)
set.seed(1)

tab1 <- data.table(
  let = rep(letters[1:2], each = 3),
  num = rep(1:3, 2),
  val = rnorm(6),
  key = c("let", "num")
)

tab2 <- data.table(
  let = rep(letters[1:2], each = 2),
  num = rep(1:2, 2),
  val = rnorm(4),
  key = c("let", "num")
)

Table 1:

> tab1
   let num        val
1:   a   1 -0.6264538
2:   a   2  0.1836433
3:   a   3 -0.8356286
4:   b   1  1.5952808
5:   b   2  0.3295078
6:   b   3 -0.8204684

Table 2:

> tab2
   let num
1:   a   1
2:   a   2
3:   b   1
4:   b   2

Is there a way to "merge" these tables such that I get all the results in tab1 that are not in tab2?:

   let num        val
1:   a   3 -0.8356286
2:   b   3 -0.8204684

Relevant post: [How to join (merge) data frames (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871) — zx8754, Nov 12 '15 at 09:18

score 18 · Accepted Answer · answered Nov 12 '15 at 08:33

18

In this case, it's equivalent to an anti join:

tab1[!tab2, on=c("let", "num")]

But setdiff() would only the first row for every let,num. This is marked for v1.9.8, FR #547.

answered Nov 12 '15 at 08:33

Arun

116,683
26
284
387

is `on` operating like a `setkey` for the merge? – Colonel Beauvel Nov 12 '15 at 11:50
Given that my examples already had set keys, this can be reduced to `tab1[!tab2]` – sebastian-c Nov 13 '15 at 09:34
@ColonelBeauvel, yes, it does ad-hoc joins. Quite useful when you've large data and don't want to reorder it just to join. Or when order needs to be preserved. Also what columns are being joined on is clear (unlike `setkey` which might happen somewhere else in the code.. And the fact that it's a join operation is also clear. – Arun Nov 13 '15 at 09:40
1

@sebastian-c, I still prefer `on=` so that it's clear at a later point as to what's being joined with what. – Arun Nov 13 '15 at 09:41
Thanks! I started learning with setkey and will consider this option from now! – Colonel Beauvel Nov 13 '15 at 14:02
@ColonelBeauvel, yes, I plan to update the vignettes to reflect this change. `setkey()` is mostly for advanced users (in my mind) and when the task involves repeated operations on the same (reordered) data (when having data sorted in RAM is quite beneficial). – Arun Nov 13 '15 at 14:28
`fsetdiff()` is now implemented - as shown in [this answer](https://stackoverflow.com/a/28703077/5977215) – SymbolixAU Jan 04 '18 at 00:12

score 0 · Answer 2 · edited Apr 24 '20 at 06:45

0

One solution would be to do a merge and remove the rows where there are values from tab2

d<-as.data.frame(merge(tab1,tab2,all=T))
t<-is.na(d[,4])
d[t,][,-4]

 let num      val.x
3   a   3 -0.8356286
6   b   3 -0.8204684

Using data.table :

merge(tab1,tab2,all=T)[is.na(val.y), 1:3]

 let num      val.x
1:   a   3 -0.8356286
2:   b   3 -0.8204684

edited Apr 24 '20 at 06:45

MichaelChirico

33,841
14
113
198

answered Nov 12 '15 at 08:24

etienne

3,648
4
23
37

This works in this case, but it might not work if I have NAs in my data. – sebastian-c Nov 13 '15 at 19:37

How can I perform a "setdiff" merge using data.table?

2 Answers2

Linked

Related