0

I have a data frame that looks like this:

id        dob lname
1 1900-01-01     a
2 1900-01-01     b
3 1900-01-01     b
4 1901-01-01     c
5 1901-01-01     d
6 1902-01-01     e
7 1902-01-01     e
8 1902-01-01     f
9 1903-01-01     g
10 1903-01-01     h

I want to filter to show all rows where dob is duplicated AND lname is duplicated, so the desired output would look like this:

id        dob lname
2 1900-01-01     b
3 1900-01-01     b
6 1902-01-01     e
7 1902-01-01     e

I tried grouping by both dob and lname, but I'm stuck on the next step, which would return all rows where those columns have duplicate values.

Here is code for the example:

id <- c(1:10)
dob <- date(c("1900-01-01", "1900-01-01", "1900-01-01", "1901-01-01", "1901-01-01", "1902-01-01", "1902-01-01", "1902-01-01", "1903-01-01", "1903-01-01"))
lname <- c("a", "b", "b", "c", "d", "e", "e", "f", "g", "h")
df <- data.frame("id" = id, "dob" = dob, "lname" = lname)
epi_n00b
  • 150
  • 3
  • 15
  • Technically your problem is the inverse of this, but you can use hadley's answer and just filter for `row_number(z) > 1` – divibisan Sep 25 '18 at 22:28
  • 1
    Related: [Subset data frame based on number of rows per group](https://stackoverflow.com/questions/20204257/subset-data-frame-based-on-number-of-rows-per-group). Adjust to your desired condition for number of rows per group. – Henrik Sep 25 '18 at 22:30
  • @divibisan do you know how you'd do it with `distinct()`, since Hadley noted it was created for this purpose? How do you call for the opposite of `distinct()` ? – epi_n00b Sep 25 '18 at 22:30
  • @divibisan using `row_number() > 1` doesn't quite give me my desired output, because it still eliminates one of the rows in each set of duplicates. Using my example above, it only gives me the rows with id's `3` and `7` . I don't believe this is a duplicate question. – epi_n00b Sep 25 '18 at 22:37
  • @epi_n00b It's definitely borderline. – divibisan Sep 25 '18 at 22:37
  • 1
    You want `n()` not `row_number()` - `df %>% group_by(dob,lname) %>% filter(n() > 1)` - which makes this a duplicate of @Henrik 's suggestion. – thelatemail Sep 25 '18 at 22:54

2 Answers2

1

Does this dplyr solution work for your needs?

library(dplyr)

    df %>%
         semi_join(df %>%
                   group_by(dob, lname) %>%
                   filter(row_number()>1), 
                   by = c("dob", "lname"))
0

Here's a one line solution using base R -

id <- c(1:10)
dob <- as.Date(c("1900-01-01", "1900-01-01", "1900-01-01", "1901-01-01", "1901-01-01", "1902-01-01", "1902-01-01", "1902-01-01", "1903-01-01", "1903-01-01"))
lname <- c("a", "b", "b", "c", "d", "e", "e", "f", "g", "h")
df <- data.frame("id" = id, "dob" = dob, "lname" = lname)

result <- df[duplicated(df[,2:3]) | duplicated(df[,2:3], fromLast = T), ]
result

For using with pipes -

df %>% .[duplicated(.[,2:3]) | duplicated(.[,2:3], fromLast = T), ]

Another dplyr approach -

df %>% filter(., ave(seq_len(nrow(.)), dob, lname, FUN = length) > 1)
Shree
  • 10,835
  • 1
  • 14
  • 36