1

I have 2 data.frames of unequal dimensions:

 df1 <- data.frame(x = c(1,2,3), 
                   y = c("foo", "bar","wow"))
 df2 <- data.frame(x = c(2,2,5,7),
                   y = c("foo", "bar","wow","new"), 
                   z = c("this","is","weird","huh"))

Where, in df1, x values are correlated to y values.

I need to know how to compare df2 to df1 to verify that variables x and y are correct: values in df2$x and df2$y must match those of df1.

How can I get the rows of df2 that show values that are either not in df1 or that do not match the correlation between x and y?

So, from these two dataframes:

# df1  
# x   y  
# 1 foo  
# 2 bar  
# 3 wow

# df2 
# x   y     z
# 2 foo  this  
# 2 bar    is  
# 5 wow weird  
# 7 new   huh 

I would like to get a result like:

# badRows
# x   y     z
# 2 foo  this  
# 5 wow weird  
# 7 new   huh  

I've tried using identical() and compare() with no luck.

Update: Provisional answer was found here (however using the multiple keys at the moment was not clear until after this question was posted): Find complement of a data frame (anti - join)

Dania
  • 305
  • 2
  • 10

1 Answers1

1

See ?dplyr::anti_join -- an anti_join will return all rows from x where there are not matching values in y and keep just the columns from x:

library(dplyr)

anti_join(df2, df1, by = c("x", "y"))
#   x   y     z
# 1 2 foo  this
# 2 5 wow weird
# 3 7 new   huh
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116