I was checking out this answer (Summarize all group values and a conditional subset in the same call) as it gives me the right idea (setting a dupe
column), but I need some more help.
So my data is structured something like this (medical data so keeping it as anonymised as possible):
index | name | category | date | checkup_complete |
---|---|---|---|---|
1 | name1 | cat1 | date1 | Y |
2 | name1 | cat1 | date1 | N |
3 | name1 | cat2 | date1 | Y |
4 | name1 | cat1 | date1 | Y |
5 | name1 | cat1 | date2 | N |
6 | name2 | cat1 | date1 | Y |
7 | name3 | cat1 | date2 | Y |
8 | name2 | cat2 | date1 | Y |
9 | name3 | cat1 | date2 | N |
10 | name2 | cat2 | date1 | Y |
11 | name1 | cat1 | date1 | N |
Data is considered a duplicate if it fits the following criteria:
For a given category
, if there are multiple name
s with the same name, on the same date
with checkup_complete
== "Y". The ones matching this criteria will get collapsed to a single record.
So based on this, the duplicates are indices: 1 and 4; 8 and 10. Indices 2 and 11 are not duplicates.
The actual table is 10s of 000s of rows, so is there a way I can use dplyr (as I'm using it as part of the overall data-frame workflow) to mark these are duplicates or not?
Based on the link I provided above I sense that I would use group_by on category
, name
, date
, and mutate a dupe
column. Question is, without creating another column (cond_dupe
) to then filter based on checkup_complete
AND dupe
, is there a way of doing it in one go?