Calculating overlapping dates in R (dplyr)

Question

I have a dataset of individuals (CSN), each of whom has had anywhere from zero to multiple interventions performed during a hospital admission (in this case, central lines placed), each with a start and an end date. I am trying to generate a new date range that will calculate any overlapping dates. In other words, I'm trying to calculate the total date range when an individual had a central line in place.

Data for example:

structure(list(CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), First_day = structure(c(1514937600, 
1514937600, 1515024000, 1515024000, 1515110400, 1515974400, 1516147200, 
1516147200, 1516147200, 1516233600, 1516233600, 1517097600, 1517097600, 
1517702400, 1517356800, 1518220800, 1519257600, 1519948800, 1520812800, 
1521504000, 1522022400), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), Last_day = structure(c(1515628800, 1515110400, 1515542400, 
1515542400, 1515628800, 1516579200, 1516320000, 1517184000, 1516233600, 
1517184000, 1517702400, 1517184000, 1517616000, 1517702400, 1518220800, 
1518825600, 1519689600, 1520812800, 1521763200, 1522108800, 1522108800
), tzone = "UTC", class = c("POSIXct", "POSIXt"))), row.names = c(NA, 
-21L), class = c("tbl_df", "tbl", "data.frame"))

Ideally, the output would return a single date range for all overlapping dates, but if there were a stretch of days that are missed by each, then a new interval would be created. So, for group 1, rows 1-5 would all have start = 2018-01-03 and end = 2018-01-11, but then row 6 would have start = 2018-01-15 and end = 2018-01-22.

I've tried to do the following:

df %>% 
  arrange(CSN_id, First_day) %>% 
  mutate(First_day = ymd(First_day),
         Last_day = ymd(Last_day),
         start = ymd("1970-01-01"),
         end = ymd("1970-01-01")) %>% 
  group_by(CSN_id) %>% 
  rowwise() %>% 
  mutate(test = if_else(row_number() == 1, interval(First_day, Last_day), interval(lag(start), lag(end))),
         start = if_else(row_number() == 1, First_day,
                         if_else(First_day <= lag(end), lag(First_day), First_day)),
         end = if_else(row_number() == 1, Last_day,
                       if_else(Last_day %within% lag(test) == TRUE, lag(end), Last_day)))

However, I don't think the lag function is working as intended, and it always returns Last_day for some reason. I tried getting rid of rowwise, but then the intervals get messed up (persistently stuck in 1970s).

The output I'm getting is:

structure(list(CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), First_day = structure(c(17534, 
17534, 17535, 17535, 17536, 17546, 17548, 17548, 17548, 17549, 
17549, 17559, 17559, 17566, 17562, 17572, 17584, 17592, 17602, 
17610, 17616), class = "Date"), Last_day = structure(c(17542, 
17536, 17541, 17541, 17542, 17553, 17550, 17560, 17549, 17560, 
17566, 17560, 17565, 17566, 17572, 17579, 17589, 17602, 17613, 
17617, 17617), class = "Date"), start = structure(c(17534, 17534, 
17535, 17535, 17536, 17546, 17548, 17548, 17548, 17549, 17549, 
17559, 17559, 17566, 17562, 17572, 17584, 17592, 17602, 17610, 
17616), class = "Date"), end = structure(c(17542, 17536, 17541, 
17541, 17542, 17553, 17550, 17560, 17549, 17560, 17566, 17560, 
17565, 17566, 17572, 17579, 17589, 17602, 17613, 17617, 17617
), class = "Date"), test = new("Interval", .Data = c(691200, 
172800, 518400, 518400, 518400, 604800, 172800, 1036800, 86400, 
950400, 1468800, 86400, 518400, 0, 864000, 604800, 432000, 864000, 
950400, 604800, 86400), start = structure(c(1514937600, 1514937600, 
1515024000, 1515024000, 1515110400, 1515974400, 1516147200, 1516147200, 
1516147200, 1516233600, 1516233600, 1517097600, 1517097600, 1517702400, 
1517356800, 1518220800, 1519257600, 1519948800, 1520812800, 1521504000, 
1522022400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    tzone = "UTC")), class = c("rowwise_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -21L), groups = structure(list(
    CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .rows = structure(list(
        1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 
        14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -21L), class = c("tbl_df", 
"tbl", "data.frame")))

Is there something obvious I'm missing? Any help would be much appreciated!

score 2 · Answer 1 · answered Mar 28 '22 at 03:48

I'm not exactly sure what your desired output is, but you can try this approach:

dat %>% 
  arrange(CSN_id,First_day,Last_day) %>% 
  group_by(CSN_id, First_day) %>%
  summarize(Last_day=max(Last_day,na.rm=T)) %>% 
  mutate(interval=as.numeric(First_day- lag(Last_day))>0,
         interval=cumsum(if_else(is.na(interval),FALSE,interval))+1) %>% 
  group_by(CSN_id,interval) %>% 
  summarize(start = min(First_day),
         end = max(Last_day))

Output:

  CSN_id interval start               end                
   <int>    <dbl> <dttm>              <dttm>             
1      1        1 2018-01-03 00:00:00 2018-01-11 00:00:00
2      1        2 2018-01-15 00:00:00 2018-01-22 00:00:00
3      2        1 2018-01-17 00:00:00 2018-01-19 00:00:00
4      3        1 2018-01-17 00:00:00 2018-02-04 00:00:00
5      3        2 2018-02-04 00:00:00 2018-02-04 00:00:00
6      4        1 2018-01-31 00:00:00 2018-02-17 00:00:00
7      4        2 2018-02-22 00:00:00 2018-02-27 00:00:00
8      4        3 2018-03-02 00:00:00 2018-03-27 00:00:00

If you prefer to retain all the original rows, and all the dates are dates and not datetimes, you could also do something like this:

dat %>% 
  mutate(across(First_day:Last_day, ~as.Date(.x))) %>% 
  arrange(CSN_id,First_day,Last_day) %>% 
  group_by(CSN_id) %>%
  mutate(interval=as.numeric(First_day- lag(Last_day))>0,
         interval=cumsum(if_else(is.na(interval),FALSE,interval))+1) %>% 
  group_by(CSN_id,interval) %>% 
  mutate(start = min(First_day),
            end = max(Last_day))

Output:

   CSN_id First_day  Last_day   interval start      end       
    <int> <date>     <date>        <dbl> <date>     <date>    
 1      1 2018-01-03 2018-01-05        1 2018-01-03 2018-01-11
 2      1 2018-01-03 2018-01-11        1 2018-01-03 2018-01-11
 3      1 2018-01-04 2018-01-10        1 2018-01-03 2018-01-11
 4      1 2018-01-04 2018-01-10        1 2018-01-03 2018-01-11
 5      1 2018-01-05 2018-01-11        1 2018-01-03 2018-01-11
 6      1 2018-01-15 2018-01-22        2 2018-01-15 2018-01-22
 7      2 2018-01-17 2018-01-19        1 2018-01-17 2018-01-19
 8      3 2018-01-17 2018-01-18        1 2018-01-17 2018-02-04
 9      3 2018-01-17 2018-01-29        1 2018-01-17 2018-02-04
10      3 2018-01-18 2018-01-29        1 2018-01-17 2018-02-04
# ... with 11 more rows

Awesome! Yeah this would definitely work. I have to parse through it to see how you did it but I can definitely work with this. Ultimately I want to preserve the number of rows and add a new variable that will be the max # days between any start and end within that CSN_id but with this table I could easily pull it back in, I think. Thank you so much for the quick response! — djd02007, Mar 28 '22 at 03:56

AndrewGB · Accepted Answer · 2022-03-28T04:50:02.703

Here is another option using the IRanges package on Bioconductor. The collapse_date_ranges function is taken from here, and I just adjusted according

library(data.table)
library(tidyverse)

collapse_date_ranges <- function(w, min.gapwidth = 1L) {
  IRanges::IRanges(start = as.integer(as.Date(w$First_day)), 
                   end = as.integer(as.Date(w$Last_day))) %>% 
    IRanges::reduce(min.gapwidth = min.gapwidth) %>% 
    as.data.table() %>% 
    .[, lapply(.SD, lubridate::as_date),
      .SDcols = c("start", "end")]
}


split(df, df$CSN_id) %>% 
  map(., ~collapse_date_ranges(., 0L)) %>% 
  bind_rows(., .id = 'id')

Output

   id      start        end
1:  1 2018-01-03 2018-01-11
2:  1 2018-01-15 2018-01-22
3:  2 2018-01-17 2018-01-19
4:  3 2018-01-17 2018-02-04
5:  4 2018-01-31 2018-02-17
6:  4 2018-02-22 2018-02-27
7:  4 2018-03-02 2018-03-27

If you want to have this in the original dataframe, then we can join the data back to the original dataframe, then use fill to add the dates to each row.

split(df, df$CSN_id) %>% 
  map(., ~collapse_date_ranges(., 0L)) %>% 
  bind_rows(., .id = 'CSN_id2') %>% 
  data.frame %>% 
  mutate(CSN_id2 = as.integer(CSN_id2)) %>% 
  full_join(df, ., by = c("CSN_id" = "CSN_id2", "First_day" = "start"), keep = TRUE) %>% 
  select(-CSN_id2) %>% 
  group_by(CSN_id) %>% 
  fill(start, end, .direction = "down")

Output

   CSN_id First_day           Last_day            start      end       
    <int> <dttm>              <dttm>              <date>     <date>    
 1      1 2018-01-03 00:00:00 2018-01-11 00:00:00 2018-01-03 2018-01-11
 2      1 2018-01-03 00:00:00 2018-01-05 00:00:00 2018-01-03 2018-01-11
 3      1 2018-01-04 00:00:00 2018-01-10 00:00:00 2018-01-03 2018-01-11
 4      1 2018-01-04 00:00:00 2018-01-10 00:00:00 2018-01-03 2018-01-11
 5      1 2018-01-05 00:00:00 2018-01-11 00:00:00 2018-01-03 2018-01-11
 6      1 2018-01-15 00:00:00 2018-01-22 00:00:00 2018-01-15 2018-01-22
 7      2 2018-01-17 00:00:00 2018-01-19 00:00:00 2018-01-17 2018-01-19
 8      3 2018-01-17 00:00:00 2018-01-29 00:00:00 2018-01-17 2018-02-04
 9      3 2018-01-17 00:00:00 2018-01-18 00:00:00 2018-01-17 2018-02-04
10      3 2018-01-18 00:00:00 2018-01-29 00:00:00 2018-01-17 2018-02-04
# … with 11 more rows

Thanks! I saw some of the questions referencing IRanges and I spent some time trying it but had trouble mapping it to my question. This seems to work quite well. As I was applying it to the larger dataset, I got NA's in some rows but I think I solved it by arranging the dataset first by CSN_id and First_day. I'm not sure if that makes sense but I'll try it on a few more snippets of the data to verify. Thanks again! — djd02007, Mar 28 '22 at 08:46
@djd02007 Yeah, I think it is dependent upon the data being in order. But you will get NAs when doing the `left_join` as I just join it on the start date, then just `fill` in the rest of the values for `start` and `end`. I'd imagine that it is probably quicker than `tidyverse`, but obviously it's a little more complicated to follow. There's always a tradeoff somewhere! You're welcome! — AndrewGB, Mar 28 '22 at 16:22

Calculating overlapping dates in R (dplyr)

2 Answers2