7

I am looking for a quick way to do 'not join' (i.e. keep rows that didn't merge, or inverse of inner join). The way I've been doing is to use data.table for X and Y, then set key. For example:

require(data.table)

X <- data.table(category = c('A','B','C','D'), val1 = c(0.2,0.3,0.8,0.7))
Y <- data.table(category = c('B','C','D','E'), val2 = c(2,3,5,7))
XY <- merge(X,Y,by='category')

> XY
   category val1 val2
1:        B  0.3    2
2:        C  0.8    3
3:        D  0.7    5

But I need the inverse of this, so I have to do:

XY_All <- merge(X,Y,by='category',all=TRUE)
setkey(XY,category)
setkey(XY_All,category)
notXY <- XY_All[!XY]    #data.table not join (finally)

> notXY
   category val1 val2
1:        A  0.2   NA
2:        E   NA    7

I feel like this is quite long winded (especially from data.frame). Am I missing something?

EDIT: I got this after thinking more about not joins

X <- data.table(category = c('A','B','C','D'), val1 = c(0.2,0.3,0.8,0.7),key = "category")
Y <- data.table(category = c('B','C','D','E'), val2 = c(2,3,5,7), key = "category")
notXY <- merge(X[!Y],Y[!X],all=TRUE)

But WheresTheAnyKey's answer below is clearer. One last hurdle is the presetting data.table keys, it'd be nice not to have to do that.

EDIT: To clarify, the accepted solution is:

merge(anti_join(X, Y, by = 'category'),anti_join(Y, X, by = 'category'), by = 'category', all = TRUE)
tanvach
  • 389
  • 1
  • 5
  • 12
  • Not in front of an R-console right now, but I believe you should be able to use something similar to ```X[!Y]``` – Mike.Gahan Jun 12 '14 at 18:03

5 Answers5

7
require(dplyr)
rbind_list(anti_join(X, Y), anti_join(Y, X))

EDIT: Since someone asked for some explanation, here's what is happening:

The first anti_join() function returns rows from X that have no matching row in Y with the match determined by what the join is joining by. The second does the reverse. rbind_list() just takes the results of its inputs and makes them into a single tbl with all the observations from each of its inputs, replacing missing variable data with NA.

stanekam
  • 3,906
  • 2
  • 22
  • 34
  • 3
    doesn't work this way. to get the expected results with anti_join, you need `merge(anti_join(X, Y, by = 'category'),anti_join(Y, X, by = 'category'), by = 'category', all = TRUE)` – rawr Jun 12 '14 at 18:05
  • Sorry. One more step :) – stanekam Jun 12 '14 at 18:06
  • Instead of just the code, you should also include an explanation of what you're doing and why it answers the question. – azurefrog Jun 12 '14 at 18:23
  • @azurefrog Here's your explanation :) – stanekam Jun 12 '14 at 18:29
  • That's really nice too. I'm torn – tanvach Jun 12 '14 at 18:41
  • @tanavach time them to see who wins! – stanekam Jun 12 '14 at 18:44
  • 1
    This solution doesn't 1) require conversion to data.table 2) set the key before merging. So hats off to you! (though WheresTheAnyKey's solution might be more convenient if you use data.table exclusively, since anti_join will convert it to data.frame) – tanvach Jun 14 '14 at 20:18
7
setkey(X,category)
setkey(Y,category)

rbind(X[!Y], Y[!X], fill = TRUE)
eddi
  • 49,088
  • 6
  • 104
  • 155
Mike.Gahan
  • 4,565
  • 23
  • 39
4

You can make it more concise like this:

X <- data.table(category = c('A','B','C','D'), val1 = c(0.2,0.3,0.8,0.7),key = "category")
Y <- data.table(category = c('B','C','D','E'), val2 = c(2,3,5,7), key = "category")
notXY <- merge(X,Y,all = TRUE)[!merge(X,Y)]
WheresTheAnyKey
  • 848
  • 5
  • 6
  • Almost, I like it! I'd accept this elegant solution if no one else comes up another without presetting key. – tanvach Jun 12 '14 at 18:35
0

Try this. First, merge with "all" set to "TRUE". Then take out all complete cases:

XY_All <- merge(X,Y,by='category',all=TRUE)
notXY  <- XY_All[!complete.cases(XY_All),]
Stu
  • 1,543
  • 3
  • 17
  • 31
  • 1
    This will only work if you don't have any `NA` values in any other columns of the tables you are merging. – MrFlick Jun 12 '14 at 18:01
  • I think that is fine, according to what @tanvach is asking for. – Stu Jun 12 '14 at 18:10
  • Unfortunately MrFlick is right, the real dataset contains a lot of NAs. But I like how this solution works without setting a key, so have an upvote! – tanvach Jun 12 '14 at 18:33
0
require(dplyr)
notXY = merge(X[!X$category %in% Y$category,], Y[!Y$category %in% X$category,],by = 
"category",all = TRUE)

One way to look at an Anti-Join is that you need observations from X not in Y and observations from Y not in X concatenated together. This can be achieved in one step as shown above.

Rohit
  • 1
  • 2