Identify duplicates based on duplicated data and some additional conditions, using `dplyr`

Question

I was checking out this answer (Summarize all group values and a conditional subset in the same call) as it gives me the right idea (setting a dupe column), but I need some more help.

So my data is structured something like this (medical data so keeping it as anonymised as possible):

index	name	category	date	checkup_complete
1	name1	cat1	date1	Y
2	name1	cat1	date1	N
3	name1	cat2	date1	Y
4	name1	cat1	date1	Y
5	name1	cat1	date2	N
6	name2	cat1	date1	Y
7	name3	cat1	date2	Y
8	name2	cat2	date1	Y
9	name3	cat1	date2	N
10	name2	cat2	date1	Y
11	name1	cat1	date1	N

Data is considered a duplicate if it fits the following criteria:

For a given category, if there are multiple names with the same name, on the same date with checkup_complete == "Y". The ones matching this criteria will get collapsed to a single record.

So based on this, the duplicates are indices: 1 and 4; 8 and 10. Indices 2 and 11 are not duplicates.

The actual table is 10s of 000s of rows, so is there a way I can use dplyr (as I'm using it as part of the overall data-frame workflow) to mark these are duplicates or not?

Based on the link I provided above I sense that I would use group_by on category, name, date, and mutate a dupe column. Question is, without creating another column (cond_dupe) to then filter based on checkup_complete AND dupe, is there a way of doing it in one go?

@TTS: Can I use `dplyr::distinct` across `name`, `category` and `date`, whilst setting a specific condition for `checkup_complete`? — azymandius, Jun 16 '21 at 14:19

Karthik S · Accepted Answer · 2021-06-16T14:20:51.547

0

Does this work:

df %>% group_by(category, name, date, checkup_complete) %>% filter(if(all(checkup_complete == 'Y')) row_number() == n() else TRUE)
# A tibble: 9 x 5
# Groups:   category, name, date, checkup_complete [8]
  index name  category date  checkup_complete
  <dbl> <chr> <chr>    <chr> <chr>           
1     2 name1 cat1     date1 N               
2     3 name1 cat2     date1 Y               
3     4 name1 cat1     date1 Y               
4     5 name1 cat1     date2 N               
5     6 name2 cat1     date1 Y               
6     7 name3 cat1     date2 Y               
7     9 name3 cat1     date2 N               
8    10 name2 cat2     date1 Y               
9    11 name1 cat1     date1 N

edited Jun 16 '21 at 14:20

answered Jun 16 '21 at 13:56

Karthik S

11,348
2
11
25

Ah whoops, let me correct that: indices 8 and 10 should have had the same date, sorry. – azymandius Jun 16 '21 at 13:58
@azymandius, the code should work, can you please try and let me know – Karthik S Jun 16 '21 at 13:58
What is the purpose of `filter(row_number() == n())`? – azymandius Jun 16 '21 at 14:01
@azymandius, it selects the last row in the group, if you give `1` inplace of `n()` it will select first row. – Karthik S Jun 16 '21 at 14:03
@azymandius, does my code answer your question? – Karthik S Jun 16 '21 at 14:07
Clever, I like it. I realise I forgot another variation in my data to take into account the actual condition I'm having trouble with. Have updated the table with an 11th index. I don't want indices 2 and 11 to be collapsed to 1 item since as long as `checkup_complete` == *"N"*, they don't count as duplicates even if they have the same `name`, `category` and `date`. – azymandius Jun 16 '21 at 14:08
@azymandius, have updated my answer, please check if that works – Karthik S Jun 16 '21 at 14:21
Thanks heaps, it worked perfectly. I was just testing it with a manual version (with a number of transformations with `distinct`) and I get the same results so this is much easier to maintain. Could you explain what `filter(if(all(checkup_complete == 'Y')) row_number() == n() else TRUE)` does? I understand that it checks for all rows where `checkup_complete` == *'Y'*. What does `else TRUE` achieve? – azymandius Jun 16 '21 at 14:48
@azymandius, `else TRUE` is catch all, i.e. all other cases are filtered in. – Karthik S Jun 16 '21 at 14:51

Identify duplicates based on duplicated data and some additional conditions, using `dplyr`

1 Answers1