Group by id and drug (with dates <100 days of each other) take the earliest and latest date

Question

Here is my dataset:

mydata = data.frame (Id =c(1,1,1,1,1,1,1,1,1,1),
                     Date = c("2000-01-01","2000-01-05","2000-02-02", "2000-02-12", 
                             "2000-02-14","2000-05-13", "2000-05-15", "2000-05-17", 
                              "2000-05-16", "2000-05-20"),
                     drug = c("A","A","B","B","B","A","A","A","C","C"))

The below code tells me the difference between dates of administration grouped by ID and drug. As you can see for drug A there is a gap >100 days between dates administered.

mydata <- mydata %>% group_by(Id, drug) %>% mutate(Diff = difftime(Date, lag(Date), units = 'days'))

The task is to group by id and drug and take the earliest and latest date of administration for each drug, but if there is a >100 day gap between the dates between the same type of drug then it needs to have it own earliest and latest date row.

The following code allows me to get the earliest and latest date, but I'm not sure how to add the 100 day gap here.

mydata %>% group_by(Id, drug) %>% 
  summarise(startDate = min(as.Date(Date),na.rm = T),
            endDate = max(as.Date(Date),na.rm = T))

The below is the output of what im hoping to get

mydata1 = data.frame (Id =c(1,1,1,1),
                     startDate = c("2000-01-01","2000-02-02","2000-05-13", "2000-05-16"),
                     endDate = c("2000-01-05", "2000-02-14", "2000-05-17", "2000-05-20"),
                     drug = c("A","B","A","C"))

As you can see for drug A there is two rows representing the first start and end date and then the second start and end date after >100 days passed between administration dates.

Any help will be really appreciated! Thank you

score 2 · Accepted Answer · answered Apr 27 '22 at 22:47

You could create a new grouping by using cumsum:

library(dplyr)

mydata %>% 
  group_by(Id, drug) %>% 
  mutate(Diff = difftime(Date, lag(Date), units = 'days')) %>%  
  group_by(Id, drug, grp = cumsum(coalesce(Diff, as.difftime(0, units = 'days')) > 100)) %>% 
  summarise(startDate = min(as.Date(Date),na.rm = T),
            endDate = max(as.Date(Date),na.rm = T),
            .groups = "drop") %>% 
  select(-grp)

This returns

# A tibble: 4 x 4
     Id drug  startDate  endDate   
  <dbl> <chr> <date>     <date>    
1     1 A     2000-01-01 2000-01-05
2     1 A     2000-05-13 2000-05-17
3     1 B     2000-02-02 2000-02-14
4     1 C     2000-05-16 2000-05-20

if possible, would you be able to explain whats happening here: group_by(Id, drug, grp = cumsum(coalesce(Diff, as.difftime(0, units = 'days')) > 100)). What does the 0 after the difftime do? Thank you! — T K, Jun 16 '22 at 18:48

Group by id and drug (with dates <100 days of each other) take the earliest and latest date

1 Answers1

Linked