0

I'd like to stick with dplyr() if possible (I'm a major fan), or base R is fine too if the solution is simple.

Suppose you have two data frames, as shown below. I'd like to create a new data frame that compares the two (df1 and df2), and only shows those complete rows from df1 whose ID doesn't appear in the ID's shown in df2. df2 serves as the "knock out" list. How would this be done?

df1 <- data.frame(
  ID = c(1,2,3,4,5),
  Value = c(10,20,30,40,50)
)

df2 <- data.frame(
  ID = c(6,7,8,1,2),
  Value = c(60,70,80,10,20)
)

The new data frame, call it df3, would look like this after applying the df2 "knock outs", when run in the R studio console:

  ID Value
1  3    30
2  4    40
3  5    50

ID's 1 and 2 got knocked out because they appear in both df1 and df2.

2 Answers2

2

This could be achieved via an anti_join:

dplyr::anti_join(df1, df2, by = "ID")
#>   ID Value
#> 1  3    30
#> 2  4    40
#> 3  5    50
stefan
  • 90,330
  • 6
  • 25
  • 51
  • I knew dplyr() would have an elegant function for this! And I won't forget a name like "anti_join"! Quick question: does putting "dplyr::" at the beginning help in any way, other than denoting that this is a dplyr function? – Curious Jorge - user9788072 Dec 30 '21 at 12:39
  • 1
    It depends. (: Here I simply used it for brevity. Personally I use the `::` notation in cases where I only need one or two functions from a package or when I need a function only once in a script or when I use functions inside my own packages or when I want to avoid nameclashes . In general when using `::` you have a small loss in performance. However, attaching a whole package also involves some one-time overhead. Hence there is a trade-off. – stefan Jan 02 '22 at 09:51
2

As I mentioned in the comments. The answer of stefan is best. Here is an alternative with Base R:

df1[!(df1$ID %in% df2$ID),]
  ID Value
3  3    30
4  4    40
5  5    50

benchmark: enter image description here

TarJae
  • 72,363
  • 6
  • 19
  • 66