9

I'm still working through the lessons on DataCamp for R, so please forgive me if this question seems naïve.

Consider the following (very contrived) sample:

library(dplyr)
library(tibble)

type <- c("Dog", "Cat", "Cat", "Cat")
name <- c("Ella", "Arrow", "Gabby", "Eddie")
pets = tibble(name, type)

name <- c("Ella", "Arrow", "Dog")
type <- c("Dog", "Cat", "Calvin")
favorites = tibble(name, type)

anti_join(favorites, pets, by = "name")
setdiff(favorites, pets, by = "name")

Both of these return exactly the same data:

> anti_join(favorites, pets, by = "name")
# A tibble: 1 × 2
   name   type
  <chr>  <chr>
1   Dog Calvin

> setdiff(favorites, pets, by = "name")
# A tibble: 1 × 2
   name   type
  <chr>  <chr>
1   Dog Calvin

The documentation for each of them seems to indicate only a subtle difference: that setdiff returns rows, but anti_join does not. From my testing, this doesn't appear to be the case.

Can someone explain to me the true differences between these two, and perhaps provide a better example that illustrates the differences more clearly? (This is an area where DataCamp hasn't been particularly helpful.)

alistaire
  • 42,459
  • 4
  • 77
  • 117
Mike Hofer
  • 16,477
  • 11
  • 74
  • 110
  • `dplyr::setdiff` honors a `by` argument/key? – lukeA Oct 20 '17 at 17:29
  • @lukeA I copied that code straight out of RStudio after executing it. So if that call to `setdiff` was resolved out of the `dplyr` package, I'd assume so. According to the `dplyr` docs, the signature for `setdiff` is `setdiff(x, y, ...)`, so it takes arguments that are passed on to other calls. Maybe I just got lucky. :) – Mike Hofer Oct 20 '17 at 17:42

2 Answers2

10

Both subset the first parameter, but setdiff requires the columns to be the same:

library(dplyr)

setdiff(mtcars, mtcars[1:30, ])
#>    mpg cyl disp  hp drat   wt qsec vs am gear carb
#> 1 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
#> 2 21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

setdiff(mtcars, mtcars[1:30, 1:6])
#> Error in setdiff_data_frame(x, y): not compatible: Cols in x but not y: `carb`, `gear`, `am`, `vs`, `qsec`.

whereas anti_join is a join, so doesn't:

anti_join(mtcars, mtcars[1:30, 1:3])
#> Joining, by = c("mpg", "cyl", "disp")
#>    mpg cyl disp  hp drat   wt qsec vs am gear carb
#> 1 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
#> 2 21.4   4  121 109 4.11 2.78 18.6  1  1    4    2
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • 1
    So the differences are in *how* they compare the data? setdiff does it by comparing the entire row, whereas anti_join only compares the keys? – Mike Hofer Oct 20 '17 at 17:23
  • 3
    Right. `dplyr::setdiff` works similarly to `base::setdiff(1:10, 1:8)`, but takes the rows as the values, meaning that each has to be identical to be discarded. `anti_join` just tries to match the specified/matching columns. – alistaire Oct 20 '17 at 17:26
  • So, is there a scenario where it'd ever be optimal to use `setdiff` over `anti_join`? – Ryan Ward Valverde Feb 25 '23 at 22:34
  • Sure, if you want stronger guarantees that you're comparing identical rows. They'll be translated into SQL by dbplyr differently as well, and `setdiff()` may be more performant as it's potentially doing a lot less work. – alistaire Feb 27 '23 at 19:10
4

One difference between setdiff and anti_join is that you can select columns in anti_join. For example:

df1 <- data.frame(a = c(1,3,5,4), b = c(5,6,7,8))
df2 <- data.frame( a = c(1,2,3,4), b = c(5,9,7,8))
#df1 looks like        df2 look like
# a   b                a   b    
# 1   5                1   5
# 3   6                2   9 
# 5   7                3   7
# 4   8                4   8

set_diff(x,y)

#The first and last rows of df1 and df2 are identical. 
#So set_diff(df1,df2) will return the 2nd and 3rd row of df1

#a b
#3 6
#5 7

#If I do the same thing with anti_join, I get the same results
anti_join(df1,df2)

#a b
#3 6
#5 7
#However,...if I only care about values in df1 column 'b' that are different from the 
# corresponding value in column b of df2... I can use the option "by" parameter..

anti_join(df1,df2, by = 'b')

 #Since column the only number in column b of df1 that is different 
#from the corresponding value in df2 is row two, 
#this returns row 2 of df1

#a b
#3 6

Another difference is that in set_diff, the two dataframes must have the same columns.

#Keeping df1 identical to df1 in the previous example... 
# and df2 the same but with an additional column

df1 <- data.frame(a = c(1,3,5,4), b = c(5,6,7,8))
df2 <- data.frame(a = c(1,2,3,4), b = c(5,9,7,8), l = c(9,9,9,9))

#df1 looks like        df2 look like
# a   b                a   b  c   
# 1   5                1   5  9
# 3   6                2   9  9 
# 5   7                3   7  9
# 4   8                4   8  9

setdiff(df1,df2)
#Returns:
# Error in setdiff_data_frame(x, y) : 
# not compatible: Cols in y but not x: `l`. 

anti_join(df1,df2)
#Ignores column 3 of df2, since there is no corresponding column in df1.  
#Returns: rows in df1 in which (a,b) are not equal to (a,b) in df2 
#(which will be identical to the output when df2 didn't have 
#a third column).

# a b
# 3 6
# 5 7

anti_join(df1,df2, by = 'b')

#Since column the only number in column b of df1 that is different
# from the corresponding value in df2 is row two, this returns row 2 of   df1...
#...same as when df2 only had two columns
Dave Rosenman
  • 1,252
  • 9
  • 13