1

I have a dataset with what are essentially episodes of time related to an individual which can overlap (i.e. an episode could start later but finish earlier than the previous). Because of this overlap issue I'm struggling to get the latest end_date in the sequence once they're in order by start_date.

The code I've been using works to a point but I have to repeat as shown in the code below. For that reason I guess I need some loop function to go through a process until a condition is met (that the end_date is later than the end_date on the previous row, or the id indicates a new individual).

library(dplyr)

## creates example dataframe
id <- c("A","A","A","A","A","A","A","A","A","A",
        "A","A","A","B","B","B","B","B","B")
start_date <- as.Date(c("2004-01-23","2005-03-31","2005-03-31","2005-12-20","2005-12-20",
                        "2006-04-03","2007-11-26","2010-10-12","2011-08-08","2012-06-26",
                        "2012-06-26","2012-09-11","2012-10-03","2003-12-01","2006-02-28",
                        "2012-04-16","2012-08-30","2012-09-19","2012-09-28"))
end_date <- as.Date(c("2009-06-30","2005-09-17","2005-09-19","2005-12-30","2005-12-30",
                      "2006-06-19","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
                      "2012-06-26","2012-09-11","2014-04-01","2012-08-29","2006-02-28",
                      "2012-04-16","2012-09-28","2013-10-11","2013-07-19"))
target_date <- as.Date(c(NA,"2009-06-30","2009-06-30","2009-06-30","2009-06-30","2009-06-30",
                         "2009-06-30","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
                         "2012-06-26","2012-09-11",NA,"2012-08-29","2012-08-29","2012-08-29",
                         "2012-09-28","2013-10-11"))

df <- data.frame(id, start_date, end_date, target_date)

Using the method to flatten overlapping time periods gets me close but I think it needs a lag adding in somewhere to replicate the target_date...

df <- df %>%
    arrange(id, start_date) %>%
    group_by(id) %>%
    mutate(indx = c(0, cumsum(as.numeric(lead(start_date)) >
                                    cummax(as.numeric(end_date)))[-n()])) %>%
    group_by(id, indx) %>%       
    mutate(latest_date = max(end_date)) %>%
    ungroup()
TimL
  • 211
  • 2
  • 11
  • IMHO, this is a variation of [David Arenburg's answer](https://stackoverflow.com/a/28938694/3817004) – Uwe Sep 09 '19 at 10:27
  • Using that answer on this data does not merge row 14 to 16 correctly in one episode. `2003-12-01 to 2012-04-16` instead of `2003-12-01 to 2012-08-29` as you would expect. – Wietze314 Sep 09 '19 at 10:50
  • @Uwe point taken and thanks for the link but as Wietze314 says it doesn't quite the desired result. I've edited the original question to reflect that – TimL Sep 09 '19 at 11:19
  • Perhaps, `df %>% group_by(id) %>% mutate(grp = cumsum(cummax(lag(as.integer(end_date), default = 0)) < as.integer(start_date))) %>% group_by(id, grp) %>% mutate(target_date_new = max(end_date)) %>% group_by(id) %>% mutate(target_date_new = lag(target_date_new))` – Uwe Sep 09 '19 at 11:30
  • @TimL Please, can you double check your expected result. IMHO, rows 17 to 19 form another overlapping period 2012-08-30 to 2013-10-11 but your expected result shows three single episodes. Thank you. – Uwe Sep 09 '19 at 11:41
  • I do have a different solution, however I am unable to post it due to duplicate mark. – Wietze314 Sep 09 '19 at 11:58
  • @Wietze314, I have voted for re-open. – Uwe Sep 09 '19 at 14:20
  • @Uwe apologies, your answer gives the desired result I am after. I think I rushed my example data and after trimming it down a little that anomaly occurred... – TimL Sep 11 '19 at 09:40

2 Answers2

1

I would give this problem a different approach than using lag. The issue is that there is a hierarchical structure in your data that can have multiple levels.

In the following code I try to look for the other episodes of which the current row is a part of (i.e. lies completely within another episode). Then I take the min(start_date) and max(end_date) to define the outer most episode.


library(dplyr)
library(tidyr)
library(purrr)

df <- data.frame(id, start_date, end_date, target_date) %>%
  mutate(episode = row_number())

df %>%
  select(id, episode,start_date, end_date) %>%
  inner_join(df %>% select(id, start_date_outer = start_date, end_date_outer = end_date,outer_episode = episode), by = 'id') %>%
  group_by(id,episode,start_date, end_date) %>%
  nest() %>%
  mutate(match = pmap(list(data,start_date,end_date), ~ ..1 %>% filter(start_date_outer <= ..2,
                                                                end_date_outer >= ..3))) %>%
  mutate(start_date_parent = as.Date(map_dbl(match, ~ min(.x$start_date_outer)),origin = '1970-01-01'),
         end_date_parent = as.Date(map_dbl(match, ~max(.x$end_date_outer)),origin = '1970-01-01'))


this results in


# A tibble: 19 x 8
   id    episode start_date end_date   data              match            start_date_parent end_date_parent
   <fct>   <int> <date>     <date>     <list>            <list>           <date>            <date>         
 1 A           1 2004-01-23 2009-06-30 <tibble [13 x 3]> <tibble [1 x 3]> 2004-01-23        2009-06-30     
 2 A           2 2005-03-31 2005-09-17 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 3 A           3 2005-03-31 2005-09-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 4 A           4 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 5 A           5 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 6 A           6 2006-04-03 2006-06-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 7 A           7 2007-11-26 2009-06-30 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 8 A           8 2010-10-12 2010-11-05 <tibble [13 x 3]> <tibble [1 x 3]> 2010-10-12        2010-11-05     
 9 A           9 2011-08-08 2011-11-18 <tibble [13 x 3]> <tibble [1 x 3]> 2011-08-08        2011-11-18     
10 A          10 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26        2012-06-26     
11 A          11 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26        2012-06-26     
12 A          12 2012-09-11 2012-09-11 <tibble [13 x 3]> <tibble [1 x 3]> 2012-09-11        2012-09-11     
13 A          13 2012-10-03 2014-04-01 <tibble [13 x 3]> <tibble [1 x 3]> 2012-10-03        2014-04-01     
14 B          14 2003-12-01 2012-08-29 <tibble [6 x 3]>  <tibble [1 x 3]> 2003-12-01        2012-08-29     
15 B          15 2006-02-28 2006-02-28 <tibble [6 x 3]>  <tibble [2 x 3]> 2003-12-01        2012-08-29     
16 B          16 2012-04-16 2012-04-16 <tibble [6 x 3]>  <tibble [2 x 3]> 2003-12-01        2012-08-29     
17 B          17 2012-08-30 2012-09-28 <tibble [6 x 3]>  <tibble [1 x 3]> 2012-08-30        2012-09-28     
18 B          18 2012-09-19 2013-10-11 <tibble [6 x 3]>  <tibble [1 x 3]> 2012-09-19        2013-10-11     
19 B          19 2012-09-28 2013-07-19 <tibble [6 x 3]>  <tibble [2 x 3]> 2012-09-19        2013-10-11  

We can see here that the first 7 episodes of id A are part of episode 1 and the rest stand on their own.


Another option would be to use sqldf for example if the dataset becomes large.


require(sqldf)

result <- sqldf("select
      df1.id, df1.episode, min(df2.start_date) AS start_date, max(df2.end_date) AS end_date
      from df AS df1

      inner join df AS df2 
      on df1.id = df2.id
      and df1.start_date >= df2.start_date
      and df1.end_date <= df2.end_date

      group by df1.id, df1.episode
      ")

result %>%
  select(id, start_date, end_date) %>%
  distinct()

results in:


  id start_date   end_date
1  A 2004-01-23 2009-06-30
2  A 2010-10-12 2010-11-05
3  A 2011-08-08 2011-11-18
4  A 2012-06-26 2012-06-26
5  A 2012-09-11 2012-09-11
6  A 2012-10-03 2014-04-01
7  B 2003-12-01 2012-08-29
8  B 2012-08-30 2012-09-28
9  B 2012-09-19 2013-10-11

Wietze314
  • 5,942
  • 2
  • 21
  • 40
1

If I understand correctly, the OP wants to identify overlapping episodes which are fully embraced by the longer episode. In addition, the end date of the embracing period should appear on the next row (within an id)

This can be accomplished by a variation of David Arenburg's approach:

df %>% 
  arrange(id, start_date) %>% # df must be ordered appropriately
  group_by(id) %>% # create new grouping variable
  mutate(grp = cumsum(cummax(lag(as.integer(end_date), default = 0)) < as.integer(end_date))) %>% 
  group_by(id, grp) %>% 
  mutate(target_date_new = max(end_date)) %>% 
  group_by(id) %>% # re-group ...
  mutate(target_date_new = lag(target_date_new)) # ... for lagging
# A tibble: 19 x 6
# Groups:   id [2]
   id    start_date end_date   target_date   grp target_date_new
   <fct> <date>     <date>     <date>      <int> <date>         
 1 A     2004-01-23 2009-06-30 NA              1 NA             
 2 A     2005-03-31 2005-09-17 2009-06-30      1 2009-06-30     
 3 A     2005-03-31 2005-09-19 2009-06-30      1 2009-06-30     
 4 A     2005-12-20 2005-12-30 2009-06-30      1 2009-06-30     
 5 A     2005-12-20 2005-12-30 2009-06-30      1 2009-06-30     
 6 A     2006-04-03 2006-06-19 2009-06-30      1 2009-06-30     
 7 A     2007-11-26 2009-06-30 2009-06-30      1 2009-06-30     
 8 A     2010-10-12 2010-11-05 2009-06-30      2 2009-06-30     
 9 A     2011-08-08 2011-11-18 2010-11-05      3 2010-11-05     
10 A     2012-06-26 2012-06-26 2011-11-18      4 2011-11-18     
11 A     2012-06-26 2012-06-26 2012-06-26      4 2012-06-26     
12 A     2012-09-11 2012-09-11 2012-06-26      5 2012-06-26     
13 A     2012-10-03 2014-04-01 2012-09-11      6 2012-09-11     
14 B     2003-12-01 2012-08-29 NA              1 NA             
15 B     2006-02-28 2006-02-28 2012-08-29      1 2012-08-29     
16 B     2012-04-16 2012-04-16 2012-08-29      1 2012-08-29     
17 B     2012-08-30 2012-09-28 2012-08-29      2 2012-08-29     
18 B     2012-09-19 2013-10-11 2012-09-28      3 2012-09-28     
19 B     2012-09-28 2013-07-19 2013-10-11      3 2013-10-11

Here, end_dates are compared because the OP wants to detect fully embraced periods. So, whenever an end_date appears which is larger than any of the previous end_dates the episode counter grp is advanced because the current episode is not fully included in the previous periods.

As cummax() has no method for objects of type Date, the dates are coerced to integer value.

Uwe
  • 41,420
  • 11
  • 90
  • 134