In R, subset a dataframe on rows whose ID appears more than once

Question

Background

I have a dataframe d with ~10,000 rows and n columns, one of which is an ID variable. Most ID's appear once, but some appear more than once. Say that it looks like this:

Problem

I'd like a new dataframe d_sub which only contains ID's that appear more than once in d. I'd like to have something that looks like this:

What I've tried

I've tried something like this:

d_sub <- subset(d, duplicated(d$ID))

But that only gets me one entry for ID's b and d, and I want each of their respective rows:

Any thoughts?

score 2 · Accepted Answer · answered Aug 24 '21 at 17:54

2

We may need to change the duplicated with | condition as duplicated by itself is FALSE for the first occurrence of 'ID'

d_sub <- subset(d, duplicated(ID)|duplicated(ID, fromLast = TRUE))

answered Aug 24 '21 at 17:54

akrun

874,273
37
540
662

This works, but I can't quite tell from the documentation on `duplicated` just *why* it works. What exactly is `fromLast` doing? ("duplicated(x, fromLast = TRUE) is equivalent to but faster than rev(duplicated(rev(x)))" --> Greek to me) – logjammin Aug 24 '21 at 17:58
1

@logjammin just take a simple example. `v1 <- c(1, 1, 2, 3); duplicated(v1); duplicated(v1, fromLast = TRUE)` the `fromLast = TRUE`, does the duplicate from the reverse. when you do `|`, any of the corresponding TRUE values becomes TRUE and thus all the elements with duplicates are TRUE, now – akrun Aug 24 '21 at 17:59
1

@logjammin depends on the function i.e. `rev` may have some additioinal checks that prevent it from being faster. whereas if you use a single function i.e. `duplicated` all those checks are already done – akrun Aug 24 '21 at 18:00
Oh I see, that's interesting. Side thought: not complaining by any means but it seems odd that this would need to be done with an `OR` condition -- seems like `R` would be able to do this more succinctly. Anyways thanks a ton! – logjammin Aug 24 '21 at 18:08
1

@logjammin `duplicated` is an old function. You are right that it should be able to have this option inherently to avoid calling `duplicated` twice – akrun Aug 24 '21 at 18:10

score 2 · Answer 2 · answered Aug 24 '21 at 19:20

We could use add_count, then filter on n:

library(dplyr)
df %>%
    add_count(ID) %>% 
    filter(n!=1) %>%
    select(-n)

Example:

library(dplyr)
df <- tribble(
    ~ID, ~gender, ~zip,
    "a", "f", 1,
    "b", "f", NA,
    "b", "m", 2,
    "c", "f", 3,
    "d", "f", NA,
    "d", "m", 4)

df %>%
    add_count(ID) %>% 
    filter(n!=1) %>%
    select(-n)

Output:

  ID    gender   zip
  <chr> <chr>  <dbl>
1 b     f         NA
2 b     m          2
3 d     f         NA
4 d     m          4

In R, subset a dataframe on rows whose ID appears more than once

2 Answers2