Checking pairs of items between two data frames

Question

I have 2 data frames (A and B) of the following structure:

A:

projectID    offerID
   20          12
   20          17 
   32          12
   32          25

B:

 projectID    offerID
   20          12
   20          17 
   32          12

and I'd like to check for pairs that are in A but not in B. So in my example, I'd like to get new df which contains the pairs that are in A but not in B:

projectID    offerID
   32           25

I tried some options; for example:

APairs <- A %>% group_by(projectID, offerID)
BPairs <- B %>% group_by(projectID, offerID)

!(APairs %in% BPairs)

but I'm getting True/False result, which I can't really understand/verify against my data.

Your help will be appreciated!

score 4 · Answer 1 · answered Jan 22 '17 at 10:26

4

In base R:

#define the key columns in the case of different structure between A and B
cols<-c("projectID","offerID")
A[!do.call(paste,A[cols]) %in% do.call(paste,B[cols]),]
#  projectID offerID
#4        32      25

answered Jan 22 '17 at 10:26

nicola

24,005
3
35
56

joel.wilson · Answer 2 · 2017-01-22T10:44:54.617

3

library(data.table)
setkey(setDT(A))
setkey(setDT(B))
A[!B]                # A[B] is similar to merge() so perform the opposite using !
#   projectID offerID
#1:        32      25

#incase there are extra columns in any of the table, the specify the common columns in a vector
common.col <- c("projectID", "offerID")
setkeyv(setDT(A), cols = common.col)
setkeyv(setDT(B), cols = common.col)
A[!B]

edited Jan 22 '17 at 10:44

answered Jan 22 '17 at 10:14

joel.wilson

8,243
5
28
48

It doesn't work for some reason. you think it's because I have more columns in one of the data frames? – staove7 Jan 22 '17 at 10:23
@staove7 i have edited based on your new query to having extra columns! let me know how this performs! – joel.wilson Jan 22 '17 at 10:39
it tells me that x is not a data.table.. :( – staove7 Jan 22 '17 at 10:42
have the `setDT()` around the `A/B` @staove7 – joel.wilson Jan 22 '17 at 10:53
2

You don't really need to set keys. It is better on use `on` which does not sort the data. – David Arenburg Jan 22 '17 at 11:12

akrun · Accepted Answer · 2017-01-22T10:26:30.897

2

We can use anti_join from dplyr

 library(dplyr)
 anti_join(A, B)
 #    projectID offerID
 #1        32      25

If there are more number of columns, specify the by option

 anti_join(A, B, by = c("projectID", "offerID"))
 #    projectID offerID
 #1        32      25

edited Jan 22 '17 at 10:26

answered Jan 22 '17 at 10:16

akrun

874,273
37
540
662

Checking pairs of items between two data frames

3 Answers3