0

What is the fastest function in r to remove the rows in a dataframe if the same two first column is in another dataframe. For example, if the data frame A is as below(with more information columns):

    NAME    SURENAME
    John    Beer
    Rose    Pitt
    Bob     Kin
    Charile Kind
    Smith   Red
    Brad    Tea
    Kale    Joe
    Ana     Bread
    Lauren  Old
    Mike    Karl 

and B as below:

NAME    SURENAME
Rose    Pitt
Smith   Red
Mike    Karl

I want B to be removed from A to be like:

    NAME    SURENAME
    John    Beer
    Bob     Kin
    Charile Kind
    Brad    Tea
    Kale    Joe
    Ana     Bread
    Lauren  Old

So in my case, A has 2 million rows (and 10 other columns) and B has 200,000 rows (all unique Name and Surnames).

Sean
  • 103
  • 9
  • 2
    `dplyr::anti_join(A, B, by = c("NAME", "SURENAME"))` – Ronak Shah Jan 14 '20 at 08:30
  • Hi Ronak, Thanks. Is that fast enough for very big data? – Sean Jan 14 '20 at 08:32
  • 2
    I have nothing to test this on. Since you have the data, you can try it and let us know if it was fast enough. – Ronak Shah Jan 14 '20 at 08:39
  • Does this answer your question? [Find complement of a data frame (anti - join)](https://stackoverflow.com/questions/28702960/find-complement-of-a-data-frame-anti-join) – Cettt Jan 14 '20 at 08:39
  • @Ronak, Thanks. I meant if there is any alternative function rather anti_join(), I can run a benchmark and let you know about computational time. Thanks anyway. – Sean Jan 14 '20 at 08:43
  • @Sean, from the link from @Cettt above, would recommend benchmarking against `fsetdiff` from the `data.table` package. – caldwellst Jan 14 '20 at 08:45
  • @Cettt, Thanks, Lots of alternative ways have been mentioned there. Appreciate it. – Sean Jan 14 '20 at 08:46
  • @caldwellst, what is the difference between fsetdiff and setdiff? aren't they the same? – Sean Jan 14 '20 at 08:47
  • @Sean the `data.table` package is designed to provide an alternative to structure to the `data.frame` that performs much better on large datasets, so will likely outperform the `dplyr` alternative. See [here](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) for more information on `data.table`. – caldwellst Jan 14 '20 at 08:49

2 Answers2

1

Maybe you can try the code below using setdiff() from dplyr package, but you need to check its speed for large data frame (I am not sure about its performance then)

C <- dplyr::setdiff(A,B)

such that

> C
     NAME SURENAME
1    John     Beer
2     Bob      Kin
3 Charile     Kind
4    Brad      Tea
5    Kale      Joe
6     Ana    Bread
7  Lauren      Old

DATA

A <- structure(list(NAME = c("John", "Rose", "Bob", "Charile", "Smith", 
"Brad", "Kale", "Ana", "Lauren", "Mike"), SURENAME = c("Beer", 
"Pitt", "Kin", "Kind", "Red", "Tea", "Joe", "Bread", "Old", "Karl"
)), class = "data.frame", row.names = c(NA, -10L))

B <- structure(list(NAME = c("Rose", "Smith", "Mike"), SURENAME = c("Pitt", 
"Red", "Karl")), class = "data.frame", row.names = c(NA, -3L))
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

Tested a benchmark filtering a data frame of 2 million rows by one with 200,000 rows, as indicated in the original post, where you can clearly see the speed of data.table relative to dplyr. Given the immense time dplyr functions took to run, particularly set_diff, I only ran each once.

rbenchmark::benchmark(
  "dplyr_anti_join" = {
    set.seed(1)
    df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    dplyr::anti_join(df, indices, by = c("a", "b"))
  },
  "dplyr_set_diff" = {
    set.seed(1)
    df <- data.frame(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.frame(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    dplyr::setdiff(df, indices)
  },
  "dt" = {
    set.seed(1)
    library(data.table)
    df <- data.table(a = letters[runif(10000000, min = 1, max = 26)],
                     b = runif(100000000, 1, 200000))

    indices <- data.table(a = letters[runif(200000, min = 1, max = 26)],
                          b = 1:200000)
    fsetdiff(df, indices)
  },
  replications = 1
)

#>              test replications elapsed relative user.self sys.self user.child sys.child
#> 1 dplyr_anti_join            1  637.06   13.165    596.86    11.50         NA        NA
#> 2  dplyr_set_diff            1 9981.93  206.281    320.67     4.66         NA        NA
#> 3              dt            1   48.39    1.000     80.61     8.73         NA        NA
caldwellst
  • 5,719
  • 6
  • 22