4

I have a dataset in long format with multiple start and end dates for patients with unique id. Dates represent hospital admission and discharge. Some patients have multiple overlapping stays in hospital, some have stays that don't overlap, and other cases have start (admission) and end dates (discharge) on same day.

Building on a post that used a lagged start date and the cummax function, I wish to do 3 things:

  1. For cases with overlapping start and end dates, combine/merge cases, keeping the earliest start date and last end date.
  2. For cases with start and end dates that are the same date, maintain that observation (don't merge).
  3. Create new variable surgdays that is calculated from max days in surgical unit (surg), for both merged and non-merged cases.

I have data like this:

   id    start        end         surg
1  A  2013-01-01    2013-01-05    0
2  A  2013-01-08    2013-01-12    1
3  A  2013-01-10    2013-01-14    6
4  A  2013-01-20    2013-01-20    3
5  B  2013-01-15    2013-01-25    4 
6  B  2013-01-20    2013-01-22    5 
7  B  2013-01-28    2013-01-28    0

What I've tried:

library(dplyr)

    data %>%
      arrange(data, id, start) %>% 
      group_by(id) %>%
      mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                                  cummax(as.numeric(end)))[-n()])) %>%
      group_by(id, indx) %>%
      summarise(start = first(start), end = last(end), surgdays = max(surg))

What I get:

       id  indx      start        end       surgdays
    1  A    0    2013-01-01    2013-01-05    0
    2  A    1    2013-01-08    2013-01-14    7
    3  A    2    2013-01-20    2013-01-20    3

The problem: the number of rows examined with this code is limited to the number of columns in my dataset. For example, with 4 variables/columns, it worked with data from only first 4 rows (including merging two rows with overlapping dates) then stopped...even though there are 7 rows in example (and thousands of rows in actual dataset).

Similarly, when I try same code with 70 columns (and hundreds of rows), it combines overlapping dates but only based on the first 70 rows. My initial thought was to create as many placeholder columns as there are observations in my dataset but this is clunky workaround.

What I'd like to get:

       id  indx     start        end       surgdays
    1  A    0    2013-01-01    2013-01-05    0
    2  A    1    2013-01-08    2013-01-14    7
    3  A    2    2013-01-20    2013-01-20    3
    4  B    0    2013-01-15    2013-01-22    9
    5  B    1    2013-01-28    2013-01-28    0
  • 1
    You may want to look into `foverlaps` or search "non-equi joins" from `data.table` – MichaelChirico Oct 09 '17 at 08:07
  • 1
    To point out one solution you can visit this [post](https://stackoverflow.com/questions/45916786/combining-overlapping-dates-by-id-based-on-a-condition/45919340#45919340). The last function (in EDIT2) aggregates into time continuums based on a 14-day rule. By erasing the "-14", you should arrive at the desired outcome excluding point 3 of your question – Patrik_P Oct 09 '17 at 08:50

1 Answers1

0

This helpful approach, originally posted here by @David Arenburg, worked fine (used all cases) after I removed the arrange() statement from the sequence of operations:

 data %>%
      group_by(id) %>%
      mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                                  cummax(as.numeric(end)))[-n()])) %>%
      group_by(id, indx) %>%
      summarise(start = first(start), end = last(end), surgdays = max(surg))

I also found this approach helpful for capturing other variables in the collapsed cases, such as admission diagnosis at earliest hospital visit. Just add to the summarise() statement:

summarise(start = first(start), end = last(end), admit_diagnosis = first(diagnosis), surgdays = max(surg))