The fastest way to remove rows from a very big dataframe matching another dataframe

Question

What is the fastest function in r to remove the rows in a dataframe if the same two first column is in another dataframe. For example, if the data frame A is as below(with more information columns):

    NAME    SURENAME
    John    Beer
    Rose    Pitt
    Bob     Kin
    Charile Kind
    Smith   Red
    Brad    Tea
    Kale    Joe
    Ana     Bread
    Lauren  Old
    Mike    Karl

and B as below:

NAME    SURENAME
Rose    Pitt
Smith   Red
Mike    Karl

I want B to be removed from A to be like:

    NAME    SURENAME
    John    Beer
    Bob     Kin
    Charile Kind
    Brad    Tea
    Kale    Joe
    Ana     Bread
    Lauren  Old

So in my case, A has 2 million rows (and 10 other columns) and B has 200,000 rows (all unique Name and Surnames).

I have nothing to test this on. Since you have the data, you can try it and let us know if it was fast enough. — Ronak Shah, Jan 14 '20 at 08:39
Does this answer your question? [Find complement of a data frame (anti - join)](https://stackoverflow.com/questions/28702960/find-complement-of-a-data-frame-anti-join) — Cettt, Jan 14 '20 at 08:39
@Ronak, Thanks. I meant if there is any alternative function rather anti_join(), I can run a benchmark and let you know about computational time. Thanks anyway. — Sean, Jan 14 '20 at 08:43
@Sean, from the link from @Cettt above, would recommend benchmarking against `fsetdiff` from the `data.table` package. — caldwellst, Jan 14 '20 at 08:45
@Cettt, Thanks, Lots of alternative ways have been mentioned there. Appreciate it. — Sean, Jan 14 '20 at 08:46
@caldwellst, what is the difference between fsetdiff and setdiff? aren't they the same? — Sean, Jan 14 '20 at 08:47
@Sean the `data.table` package is designed to provide an alternative to structure to the `data.frame` that performs much better on large datasets, so will likely outperform the `dplyr` alternative. See [here](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) for more information on `data.table`. — caldwellst, Jan 14 '20 at 08:49

score 1 · Answer 1 · answered Jan 14 '20 at 08:41

1

Maybe you can try the code below using setdiff() from dplyr package, but you need to check its speed for large data frame (I am not sure about its performance then)

C <- dplyr::setdiff(A,B)

such that

> C
     NAME SURENAME
1    John     Beer
2     Bob      Kin
3 Charile     Kind
4    Brad      Tea
5    Kale      Joe
6     Ana    Bread
7  Lauren      Old

DATA

A <- structure(list(NAME = c("John", "Rose", "Bob", "Charile", "Smith", 
"Brad", "Kale", "Ana", "Lauren", "Mike"), SURENAME = c("Beer", 
"Pitt", "Kin", "Kind", "Red", "Tea", "Joe", "Bread", "Old", "Karl"
)), class = "data.frame", row.names = c(NA, -10L))

B <- structure(list(NAME = c("Rose", "Smith", "Mike"), SURENAME = c("Pitt", 
"Red", "Karl")), class = "data.frame", row.names = c(NA, -3L))

answered Jan 14 '20 at 08:41

ThomasIsCoding

96,636
9
24
81

Thanks, Thomas, Will run a benchmark between setdiff() and anti_join() and update you. – Sean Jan 14 '20 at 08:44
Posted benchmark here. I was shocked by how bad `dplyr::setdiff` did!! – caldwellst Jan 14 '20 at 19:43
@caldwellst haha, then at least now you know `setdiff` is not a good idea – ThomasIsCoding Jan 14 '20 at 19:45

score 1 · Accepted Answer · answered Jan 14 '20 at 19:31

Tested a benchmark filtering a data frame of 2 million rows by one with 200,000 rows, as indicated in the original post, where you can clearly see the speed of data.table relative to dplyr. Given the immense time dplyr functions took to run, particularly set_diff, I only ran each once.

rbenchmark::benchmark(
  "dplyr_anti_join" = {
    set.seed(1)
    df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    dplyr::anti_join(df, indices, by = c("a", "b"))
  },
  "dplyr_set_diff" = {
    set.seed(1)
    df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    dplyr::setdiff(df, indices)
  },
  "dt" = {
    set.seed(1)
    library(data.table)
    df <- data.table(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.table(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    fsetdiff(df, indices)
  },
  replications = 1
)

#>              test replications elapsed relative user.self sys.self user.child sys.child
#> 1 dplyr_anti_join            1  637.06   13.165    596.86    11.50         NA        NA
#> 2  dplyr_set_diff            1 9981.93  206.281    320.67     4.66         NA        NA
#> 3              dt            1   48.39    1.000     80.61     8.73         NA        NA

The fastest way to remove rows from a very big dataframe matching another dataframe

2 Answers2