1

Background

I have a dataframe d with ~10,000 rows and n columns, one of which is an ID variable. Most ID's appear once, but some appear more than once. Say that it looks like this:

d

Problem

I'd like a new dataframe d_sub which only contains ID's that appear more than once in d. I'd like to have something that looks like this:

d_sub

What I've tried

I've tried something like this:

d_sub <- subset(d, duplicated(d$ID))

But that only gets me one entry for ID's b and d, and I want each of their respective rows:

failed_attempt

Any thoughts?

logjammin
  • 1,121
  • 6
  • 21

2 Answers2

2

We may need to change the duplicated with | condition as duplicated by itself is FALSE for the first occurrence of 'ID'

d_sub <- subset(d, duplicated(ID)|duplicated(ID, fromLast = TRUE))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This works, but I can't quite tell from the documentation on `duplicated` just *why* it works. What exactly is `fromLast` doing? ("duplicated(x, fromLast = TRUE) is equivalent to but faster than rev(duplicated(rev(x)))" --> Greek to me) – logjammin Aug 24 '21 at 17:58
  • 1
    @logjammin just take a simple example. `v1 <- c(1, 1, 2, 3); duplicated(v1); duplicated(v1, fromLast = TRUE)` the `fromLast = TRUE`, does the duplicate from the reverse. when you do `|`, any of the corresponding TRUE values becomes TRUE and thus all the elements with duplicates are TRUE, now – akrun Aug 24 '21 at 17:59
  • 1
    @logjammin depends on the function i.e. `rev` may have some additioinal checks that prevent it from being faster. whereas if you use a single function i.e. `duplicated` all those checks are already done – akrun Aug 24 '21 at 18:00
  • Oh I see, that's interesting. Side thought: not complaining by any means but it seems odd that this would need to be done with an `OR` condition -- seems like `R` would be able to do this more succinctly. Anyways thanks a ton! – logjammin Aug 24 '21 at 18:08
  • 1
    @logjammin `duplicated` is an old function. You are right that it should be able to have this option inherently to avoid calling `duplicated` twice – akrun Aug 24 '21 at 18:10
2

We could use add_count, then filter on n:

library(dplyr)
df %>%
    add_count(ID) %>% 
    filter(n!=1) %>%
    select(-n)

Example:

library(dplyr)
df <- tribble(
    ~ID, ~gender, ~zip,
    "a", "f", 1,
    "b", "f", NA,
    "b", "m", 2,
    "c", "f", 3,
    "d", "f", NA,
    "d", "m", 4)

df %>%
    add_count(ID) %>% 
    filter(n!=1) %>%
    select(-n)

Output:

  ID    gender   zip
  <chr> <chr>  <dbl>
1 b     f         NA
2 b     m          2
3 d     f         NA
4 d     m          4
TarJae
  • 72,363
  • 6
  • 19
  • 66