0

I have a large dataset with over 1.5 Million rows and some columns - one of which contains an ID for every observation. I would like to only extract the rows with the same ID to a new dataframe or restructure the original dataframe that it fits the criteria. I've tried merging but couldn't quite make it work. Checking for duplicates also takes forever.

Does anyone know of an efficient way in terms of time complexity to filter these rows from the dataset?

Original dataframe looks similar to this:

id attribute value
1 attribute 1 value 1
1 attribute 2 value 2
1 attribute 3 value 3
2 attribute 1 value 1
2 attribute 2 value 2

The result should ideally look like this:

id attribute 1 attribute 2 attribute 3
1 value 1 value 2 value 3
2 value 1 value 2 NA
luism
  • 33
  • 6
  • Can you provide a minimum reproducible example? It's likely quite simple with a combination of `group_by()`, `filter()` and `n()` – Harrison Jones Oct 07 '22 at 14:54
  • `library(dplyr); df %>% add_count(ID) %>% filter(n > 1)` or `df %>% group_by(ID) %>% filter(n() > 1)` – Jon Spring Oct 07 '22 at 16:38
  • If you want something more substantial, please make this question *reproducible*. This includes sample code you've attempted (including listing non-base R packages, and any errors/warnings received), sample *unambiguous* data (e.g., `data.frame(x=...,y=...)` or the output from `dput(head(x))`), and intended output given that input. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Oct 07 '22 at 16:41
  • @HarrisonJones Sure, sorry for the late response - I've edited the question to make it more reproducible. Does this work for you? – luism Oct 10 '22 at 07:02
  • @r2evans Sorry, I'm new to stackoverflow and R. I've edited the question to make it more reproducible – luism Oct 10 '22 at 07:04
  • Your question should be answered by this one: https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format. I'd recommend the third answer with `pivot_wider()` in the `tidyr` package. – Harrison Jones Oct 11 '22 at 14:17
  • Yes thanks! Just learned about the different formats – luism Oct 12 '22 at 15:03

1 Answers1

2

Guessing without reproducibility.

  • if you need to extract just the second and subsequent instances of duplicates, try

    dupedata <- subset(yourdata, duplicated(ID))
    
  • if you need all rows where there are any duplicates, then either of

    ## option 1
    dupes <- ave(rep(0L, nrow(yourdata)), yourdata$ID, FUN = length) > 1
    dupedata <- yourdata[dupes,]
    ## option 2
    dupes <- names(table(yourdata$ID))
    dupedata <- subset(yourdata, ID %in% dupes)
    
r2evans
  • 141,215
  • 6
  • 77
  • 149