Filter only rows that are duplicated using dplyr

Question

I have been trying for a while now to solve a problem close to the one as presented at this issue with no success. This consists in filtering for items that are duplicated in a group, but also considering the original one used for comparison with dplyr (I prefer dplyr over base or data.table).

The solution I tried is as follows:

> a <- data.frame(name=c("a","b","b","b","a","a"),position=c(1,2,1,2,2,2),achieved=c(1,0,0,0,1,0))
> a %>% group_by(name,achieved) %>% mutate(duplicated=duplicated(position))
# A tibble: 6 x 4
# Groups:   name, achieved [3]
  name  position achieved duplicated
  <fct>    <dbl>    <dbl> <lgl>     
1 a            1        1 FALSE     
2 b            2        0 FALSE     
3 b            1        0 FALSE     
4 b            2        0 TRUE      
5 a            2        1 FALSE     
6 a            2        0 FALSE

I know that this solution is close to the one I desire, but it only brings me the values that are duplicated after the first one, but I would also like a dplyr solution that gives me all duplicate values per group, so probably this could help me improve my dplyr understanding.

The desired output would be as follows:

# A tibble: 6 x 4
# Groups:   name, achieved [3]
  name  position achieved duplicated
  <fct>    <dbl>    <dbl> <lgl>     
1 a            1        1 FALSE     
2 b            2        0 TRUE      
3 b            1        0 FALSE     
4 b            2        0 TRUE      
5 a            2        1 FALSE     
6 a            2        0 FALSE

Thanks in advance.

score 3 · Answer 1 · answered Feb 20 '19 at 18:47

It seems like you want to group by all of name, position, and acheived and then just see if there are more than one record in that group

a %>% group_by(name,achieved, position) %>% mutate(duplicated = n()>1)

#   name  position achieved duplicated
#  <fct>    <dbl>    <dbl> <lgl>     
# 1 a            1        1 FALSE     
# 2 b            2        0 TRUE      
# 3 b            1        0 FALSE     
# 4 b            2        0 TRUE      
# 5 a            2        1 FALSE     
# 6 a            2        0 FALSE

Actually this has helped me to solve my problem! Thank you very much, this is applicable to my actual problem — Just Burfi, Feb 20 '19 at 18:56

score 2 · Answer 2 · answered Feb 20 '19 at 18:46

2

Try this:

a %>%
  group_by_all() %>%
  mutate(duplicated = n() > 1)

answered Feb 20 '19 at 18:46

arg0naut91

14,574
2
17
38

Actually this has helped me to solve my problem! Thank you very much – Just Burfi Feb 20 '19 at 18:56
You're welcome! Note that this will flag duplicates by all columns in the dataset; thus it'll only work if `name`, `position` and `achieved` are the only columns in your data frame. Otherwise see the other solution in this topic, the logic is the same – arg0naut91 Feb 20 '19 at 18:58
In my real case (more columns), there would be a slight difference, as you would say, there would be some extra columns that might not be as useful for flagging these duplicates. Any systematic way to ignore these columns without errasing them? Thanks again for your useful observations. – Just Burfi Feb 20 '19 at 19:12
The other solution here should work. Apart from that I don't see much better dplyr options. – arg0naut91 Feb 20 '19 at 19:13

Filter only rows that are duplicated using dplyr

2 Answers2