3

I have the following data frame:

> df1 <- data.frame("valA" = c(1,1,1,1,2,1,3,3,3), "valB" = c(1,2,3,1,2,3,1,2,3), "Score" = c(100,90,80,100, 60,80,10,20,30))
> df1
  valA valB Score
1    1    1   100
2    1    2    90
3    1    3    80
4    1    1   100
5    2    2    60
6    1    3    80
7    3    1    10
8    3    2    20
9    3    3    30

And I want the duplicated value (the expected result is):

     valA  valB Score
 1     1     1   100
 2     1     3    80
 3     1     1   100
 4     1     3    80

I know that there is code to take unique value in dplyr::distinct, but I need to know which rows are duplicated, not removing the duplicate from the data frame. And I tried R base duplicated function, but it's too slow since my data is large (more than 20 million row). I also tried:

duplicated_df1 <- df1 %>% group_by(valA, valB, Score) %>% filter(n() > 1)

which can lead to the expected result above, but again, it's too slow and I don't have enough RAM. Can anyone please suggest me efficient and fast method to find the duplicated row?

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
kalong
  • 33
  • 1
  • 3
  • Did you try simply `duplicated(df1)`? – talat Dec 20 '17 at 09:39
  • duplicated would only return the "real" duplicates. – Andre Elrico Dec 20 '17 at 09:41
  • This? `df1[duplicated(df1) | duplicated(df1, fromLast = T), ]` or `df1 %>% filter(duplicated(df1) | duplicated(df1,fromLast = T))` – Roman Dec 20 '17 at 09:43
  • Have a look at [this](https://stackoverflow.com/questions/7854433/finding-all-duplicate-rows-including-elements-with-smaller-subscripts) – Sotos Dec 20 '17 at 09:45
  • @docendodiscimus I tried 'df1[duplicated(df1),]' , it works for small data but my computer crash since I have more than 20 million observation, so I need a faster code. – kalong Dec 20 '17 at 09:55
  • Then you should probably try data.table: `library(data.table); setDT(df1, key = c("valA", "valB", "Score")); df1[, N := .N, by = key(df1)]; df1[N > 1]` – talat Dec 20 '17 at 09:59
  • Hi @Jimbou , thanks for the answer, the code works for small data set, but when I tried to run for my data (i.e. 20 million observation), my R studio crashed. – kalong Dec 20 '17 at 10:00
  • @kalong If you have 20 million observations you should not use dplyr. Removed my solution since it will be extremely slow in that case. – Andre Elrico Dec 20 '17 at 10:01
  • @docendodiscimus Thank you for your answer, it works really fast! – kalong Dec 20 '17 at 10:22

1 Answers1

0

For large-ish data, it's often useful to try a data.table approch. In this case you can find duplicate rows using:

library(data.table)
setDT(df1, key = c("valA", "valB", "Score"))
df1[, N := .N, by = key(df1)]                # count rows per group
df1[N > 1]
talat
  • 68,970
  • 21
  • 126
  • 157