-1

I have a large dataframe in R (1.3 mil row, 51 columns). I am not sure if there are any duplicate rows but I want to find out. I tried using the duplicate() function but it took too long and ended up freezing my Rstudio. I dont need to know which entries are duplicate, I just want to delete the ones that are.

Does anyone know how to do this with out it taking 20+ minutes and eventually not loading?

Thanks

Phil
  • 7,287
  • 3
  • 36
  • 66
user4999605
  • 431
  • 1
  • 5
  • 14
  • Does this answer your question? [Remove duplicated rows](https://stackoverflow.com/questions/13967063/remove-duplicated-rows) – Raman Mishra Oct 16 '20 at 13:43

2 Answers2

0

I don't know how you used the duplicated function. It seems like this way should be relatively quick even if the dataframe is large (I've tested it on a dataframe with 1.4m rows and 32 columns: it took less than 2min):

df[-which(duplicated(df)), ]
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

The first one is to extract complete duplicates or over 1(maybe triples)

The second is to removes duplicates or over one.

duplication <- df %>% group_by(col) %>% filter(any(row_number() > 1))
unique_df <- df %>% group_by(col) %>% filter(!any(row_number() > 1))

you can use these too.

dup <- df[duplicated(df$col)|duplicated(df$col, fromLast=TRUE),]
uni_df <- df[!duplicated(df$col)|duplicated(df$col, fromLast=TRUE),]


*** If you want to get the whole df then you can use this***

df %>%
  group_by_all() %>%
  count() %>%
  filter(n > 1)

Kian
  • 110
  • 7