How to identify number of duplicate rows in R (and remove)

Question

I have a large dataframe in R (1.3 mil row, 51 columns). I am not sure if there are any duplicate rows but I want to find out. I tried using the duplicate() function but it took too long and ended up freezing my Rstudio. I dont need to know which entries are duplicate, I just want to delete the ones that are.

Does anyone know how to do this with out it taking 20+ minutes and eventually not loading?

Thanks

Does this answer your question? [Remove duplicated rows](https://stackoverflow.com/questions/13967063/remove-duplicated-rows) — Raman Mishra, Oct 16 '20 at 13:43

score 0 · Answer 1 · answered Oct 16 '20 at 14:14

0

I don't know how you used the duplicated function. It seems like this way should be relatively quick even if the dataframe is large (I've tested it on a dataframe with 1.4m rows and 32 columns: it took less than 2min):

df[-which(duplicated(df)), ]

answered Oct 16 '20 at 14:14

Chris Ruehlemann

20,321
4
12
34

score 0 · Answer 2 · answered Oct 16 '20 at 14:47

The first one is to extract complete duplicates or over 1(maybe triples)

The second is to removes duplicates or over one.

duplication <- df %>% group_by(col) %>% filter(any(row_number() > 1))
unique_df <- df %>% group_by(col) %>% filter(!any(row_number() > 1))

you can use these too.

dup <- df[duplicated(df$col)|duplicated(df$col, fromLast=TRUE),]
uni_df <- df[!duplicated(df$col)|duplicated(df$col, fromLast=TRUE),]


*** If you want to get the whole df then you can use this***

df %>%
  group_by_all() %>%
  count() %>%
  filter(n > 1)

How to identify number of duplicate rows in R (and remove)

2 Answers2