Filter to all rows where there are duplicate values in two columns (dplyr)

Question

I have a data frame that looks like this:

id        dob lname
1 1900-01-01     a
2 1900-01-01     b
3 1900-01-01     b
4 1901-01-01     c
5 1901-01-01     d
6 1902-01-01     e
7 1902-01-01     e
8 1902-01-01     f
9 1903-01-01     g
10 1903-01-01     h

I want to filter to show all rows where dob is duplicated AND lname is duplicated, so the desired output would look like this:

id        dob lname
2 1900-01-01     b
3 1900-01-01     b
6 1902-01-01     e
7 1902-01-01     e

I tried grouping by both dob and lname, but I'm stuck on the next step, which would return all rows where those columns have duplicate values.

Here is code for the example:

id <- c(1:10)
dob <- date(c("1900-01-01", "1900-01-01", "1900-01-01", "1901-01-01", "1901-01-01", "1902-01-01", "1902-01-01", "1902-01-01", "1903-01-01", "1903-01-01"))
lname <- c("a", "b", "b", "c", "d", "e", "e", "f", "g", "h")
df <- data.frame("id" = id, "dob" = dob, "lname" = lname)

Technically your problem is the inverse of this, but you can use hadley's answer and just filter for `row_number(z) > 1` — divibisan, Sep 25 '18 at 22:28
Related: [Subset data frame based on number of rows per group](https://stackoverflow.com/questions/20204257/subset-data-frame-based-on-number-of-rows-per-group). Adjust to your desired condition for number of rows per group. — Henrik, Sep 25 '18 at 22:30
@divibisan do you know how you'd do it with `distinct()`, since Hadley noted it was created for this purpose? How do you call for the opposite of `distinct()` ? — epi_n00b, Sep 25 '18 at 22:30
@divibisan using `row_number() > 1` doesn't quite give me my desired output, because it still eliminates one of the rows in each set of duplicates. Using my example above, it only gives me the rows with id's `3` and `7` . I don't believe this is a duplicate question. — epi_n00b, Sep 25 '18 at 22:37
You want `n()` not `row_number()` - `df %>% group_by(dob,lname) %>% filter(n() > 1)` - which makes this a duplicate of @Henrik 's suggestion. — thelatemail, Sep 25 '18 at 22:54

score 1 · Answer 1 · answered Sep 25 '18 at 23:22

1

Does this dplyr solution work for your needs?

library(dplyr)

    df %>%
         semi_join(df %>%
                   group_by(dob, lname) %>%
                   filter(row_number()>1), 
                   by = c("dob", "lname"))

answered Sep 25 '18 at 23:22

Christian Million

662
5
10

Shree · Accepted Answer · 2018-09-25T22:51:20.267

Here's a one line solution using base R -

id <- c(1:10)
dob <- as.Date(c("1900-01-01", "1900-01-01", "1900-01-01", "1901-01-01", "1901-01-01", "1902-01-01", "1902-01-01", "1902-01-01", "1903-01-01", "1903-01-01"))
lname <- c("a", "b", "b", "c", "d", "e", "e", "f", "g", "h")
df <- data.frame("id" = id, "dob" = dob, "lname" = lname)

result <- df[duplicated(df[,2:3]) | duplicated(df[,2:3], fromLast = T), ]
result

For using with pipes -

df %>% .[duplicated(.[,2:3]) | duplicated(.[,2:3], fromLast = T), ]

Another dplyr approach -

df %>% filter(., ave(seq_len(nrow(.)), dob, lname, FUN = length) > 1)

Filter to all rows where there are duplicate values in two columns (dplyr)

2 Answers2