0

I was checking out this answer (Summarize all group values and a conditional subset in the same call) as it gives me the right idea (setting a dupe column), but I need some more help.

So my data is structured something like this (medical data so keeping it as anonymised as possible):

index name category date checkup_complete
1 name1 cat1 date1 Y
2 name1 cat1 date1 N
3 name1 cat2 date1 Y
4 name1 cat1 date1 Y
5 name1 cat1 date2 N
6 name2 cat1 date1 Y
7 name3 cat1 date2 Y
8 name2 cat2 date1 Y
9 name3 cat1 date2 N
10 name2 cat2 date1 Y
11 name1 cat1 date1 N

Data is considered a duplicate if it fits the following criteria:

For a given category, if there are multiple names with the same name, on the same date with checkup_complete == "Y". The ones matching this criteria will get collapsed to a single record.

So based on this, the duplicates are indices: 1 and 4; 8 and 10. Indices 2 and 11 are not duplicates.

The actual table is 10s of 000s of rows, so is there a way I can use dplyr (as I'm using it as part of the overall data-frame workflow) to mark these are duplicates or not?

Based on the link I provided above I sense that I would use group_by on category, name, date, and mutate a dupe column. Question is, without creating another column (cond_dupe) to then filter based on checkup_complete AND dupe, is there a way of doing it in one go?

azymandius
  • 13
  • 3

1 Answers1

0

Does this work:

df %>% group_by(category, name, date, checkup_complete) %>% filter(if(all(checkup_complete == 'Y')) row_number() == n() else TRUE)
# A tibble: 9 x 5
# Groups:   category, name, date, checkup_complete [8]
  index name  category date  checkup_complete
  <dbl> <chr> <chr>    <chr> <chr>           
1     2 name1 cat1     date1 N               
2     3 name1 cat2     date1 Y               
3     4 name1 cat1     date1 Y               
4     5 name1 cat1     date2 N               
5     6 name2 cat1     date1 Y               
6     7 name3 cat1     date2 Y               
7     9 name3 cat1     date2 N               
8    10 name2 cat2     date1 Y               
9    11 name1 cat1     date1 N       
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • Ah whoops, let me correct that: indices 8 and 10 should have had the same date, sorry. – azymandius Jun 16 '21 at 13:58
  • @azymandius, the code should work, can you please try and let me know – Karthik S Jun 16 '21 at 13:58
  • What is the purpose of `filter(row_number() == n())`? – azymandius Jun 16 '21 at 14:01
  • @azymandius, it selects the last row in the group, if you give `1` inplace of `n()` it will select first row. – Karthik S Jun 16 '21 at 14:03
  • @azymandius, does my code answer your question? – Karthik S Jun 16 '21 at 14:07
  • Clever, I like it. I realise I forgot another variation in my data to take into account the actual condition I'm having trouble with. Have updated the table with an 11th index. I don't want indices 2 and 11 to be collapsed to 1 item since as long as `checkup_complete` == *"N"*, they don't count as duplicates even if they have the same `name`, `category` and `date`. – azymandius Jun 16 '21 at 14:08
  • @azymandius, have updated my answer, please check if that works – Karthik S Jun 16 '21 at 14:21
  • Thanks heaps, it worked perfectly. I was just testing it with a manual version (with a number of transformations with `distinct`) and I get the same results so this is much easier to maintain. Could you explain what `filter(if(all(checkup_complete == 'Y')) row_number() == n() else TRUE)` does? I understand that it checks for all rows where `checkup_complete` == *'Y'*. What does `else TRUE` achieve? – azymandius Jun 16 '21 at 14:48
  • @azymandius, `else TRUE` is catch all, i.e. all other cases are filtered in. – Karthik S Jun 16 '21 at 14:51