High-performance way to find duplicated rows (using dplyr) on big data set

Question

I have the following data frame:

> df1 <- data.frame("valA" = c(1,1,1,1,2,1,3,3,3), "valB" = c(1,2,3,1,2,3,1,2,3), "Score" = c(100,90,80,100, 60,80,10,20,30))
> df1
  valA valB Score
1    1    1   100
2    1    2    90
3    1    3    80
4    1    1   100
5    2    2    60
6    1    3    80
7    3    1    10
8    3    2    20
9    3    3    30

And I want the duplicated value (the expected result is):

     valA  valB Score
 1     1     1   100
 2     1     3    80
 3     1     1   100
 4     1     3    80

I know that there is code to take unique value in dplyr::distinct, but I need to know which rows are duplicated, not removing the duplicate from the data frame. And I tried R base duplicated function, but it's too slow since my data is large (more than 20 million row). I also tried:

duplicated_df1 <- df1 %>% group_by(valA, valB, Score) %>% filter(n() > 1)

which can lead to the expected result above, but again, it's too slow and I don't have enough RAM. Can anyone please suggest me efficient and fast method to find the duplicated row?

This? `df1[duplicated(df1) | duplicated(df1, fromLast = T), ]` or `df1 %>% filter(duplicated(df1) | duplicated(df1,fromLast = T))` — Roman, Dec 20 '17 at 09:43
Have a look at [this](https://stackoverflow.com/questions/7854433/finding-all-duplicate-rows-including-elements-with-smaller-subscripts) — Sotos, Dec 20 '17 at 09:45
@docendodiscimus I tried 'df1[duplicated(df1),]' , it works for small data but my computer crash since I have more than 20 million observation, so I need a faster code. — kalong, Dec 20 '17 at 09:55
Then you should probably try data.table: `library(data.table); setDT(df1, key = c("valA", "valB", "Score")); df1[, N := .N, by = key(df1)]; df1[N > 1]` — talat, Dec 20 '17 at 09:59
Hi @Jimbou , thanks for the answer, the code works for small data set, but when I tried to run for my data (i.e. 20 million observation), my R studio crashed. — kalong, Dec 20 '17 at 10:00
@kalong If you have 20 million observations you should not use dplyr. Removed my solution since it will be extremely slow in that case. — Andre Elrico, Dec 20 '17 at 10:01
@docendodiscimus Thank you for your answer, it works really fast! — kalong, Dec 20 '17 at 10:22

score 0 · Accepted Answer · answered Dec 20 '17 at 10:24

For large-ish data, it's often useful to try a data.table approch. In this case you can find duplicate rows using:

library(data.table)
setDT(df1, key = c("valA", "valB", "Score"))
df1[, N := .N, by = key(df1)]                # count rows per group
df1[N > 1]

High-performance way to find duplicated rows (using dplyr) on big data set

1 Answers1