Take the earliest mention of a figure from within a range by ID

Question

Similar to Extract rows for the first occurrence of a variable in a data frame Dataset example:

ID   grade    date 
A     1        1/1/2001
A     3        1/2/2002 
A     5        2/4/2002
B     4        1/1/2008
B     4        1/1/2010
B     5        1/1/2011
B     5        1/3/2011
B     4        1/5/2011
C     2        1/1/2006
C     5        1/1/2007
C     5        1/1/2008
D     3        1/1/1996
D     5        1/1/1997
D     5        1/1/1999
E     1        1/1/2003
E     3        1/1/2005
E     3        1/1/2007

I would like to keep all entries per person until they hit grade 5 and then stop irrespective of grades after the earliest mention of 5. Not all IDs have reached stage 5, they should remain untouched even if they have multiple of the same grade

Desired output:

ID   grade    date 
A     1        1/1/2001
A     3        1/2/2002 
A     5        2/4/2002
B     4        1/1/2008
B     4        1/1/2010
B     5        1/1/2011
C     2        1/1/2006
C     5        1/1/2007
D     3        1/1/1996
D     5        1/1/1997
E     1        1/1/2003
E     3        1/1/2005
E     3        1/1/2007

I can used order and in dplyr the filter function but I am unsure how to apply these to just the 5's per person and then exclude anything afterwards?

Many thanks

Sotos · Accepted Answer · 2022-12-05T15:03:21.993

2

You can use slice to grab the sequence from 1 until the first appearance of 5, i.e.

library(dplyr)

df %>% 
 group_by(ID) %>% 
 slice(1L : first(which(grade == 5)[1])) 

# A tibble: 10 × 3
# Groups:   ID [4]
   ID    grade date    
   <chr> <int> <chr>   
 1 A         1 1/1/2001
 2 A         3 1/2/2002
 3 A         5 2/4/2002
 4 B         4 1/1/2008
 5 B         4 1/1/2010
 6 B         5 1/1/2011
 7 C         2 1/1/2006
 8 C         5 1/1/2007
 9 D         3 1/1/1996
10 D         5 1/1/1997

EDIT If you have groups that do NOT have grade = 5, then

df %>% 
 group_by(ID) %>% 
 mutate(new = which(grade == 5)[1], 
        new = replace(new, is.na(new), max(row_number()))) %>% 
 slice(1L : first(new)) %>%
 select(-new)

# A tibble: 13 × 4
# Groups:   ID [5]
   ID    grade date       
   <chr> <int> <chr>    
 1 A         1 1/1/2001     
 2 A         3 1/2/2002     
 3 A         5 2/4/2002     
 4 B         4 1/1/2008     
 5 B         4 1/1/2010     
 6 B         5 1/1/2011     
 7 C         2 1/1/2006     
 8 C         5 1/1/2007     
 9 D         3 1/1/1996     
10 D         5 1/1/1997     
11 E         1 1/1/2003     
12 E         3 1/1/2005     
13 E         3 1/1/2007

edited Dec 05 '22 at 15:03

answered Dec 05 '22 at 14:38

Sotos

51,121
6
32
66

Thanks for taking the time to answer. When I do this I get "Error in 1L: first(which(grade== 5)[1]) : NA/NaN argument" – tacrolimus Dec 05 '22 at 14:48
Do you have IDs with all NAs or without 5? – Sotos Dec 05 '22 at 14:50
Not that I am aware of but will double check. Not all IDs have a grade 5, some stop earlier if that is useful to know? – tacrolimus Dec 05 '22 at 14:51
Yes that was it. It couldnt find 5. Let me revise – Sotos Dec 05 '22 at 15:00
Apologies! I can confirm there are no NAs – tacrolimus Dec 05 '22 at 15:01
I edited my answer. Try and let me know – Sotos Dec 05 '22 at 15:03

Take the earliest mention of a figure from within a range by ID

1 Answers1