I have a dataset of individuals (CSN), each of whom has had anywhere from zero to multiple interventions performed during a hospital admission (in this case, central lines placed), each with a start and an end date. I am trying to generate a new date range that will calculate any overlapping dates. In other words, I'm trying to calculate the total date range when an individual had a central line in place.
Data for example:
structure(list(CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), First_day = structure(c(1514937600,
1514937600, 1515024000, 1515024000, 1515110400, 1515974400, 1516147200,
1516147200, 1516147200, 1516233600, 1516233600, 1517097600, 1517097600,
1517702400, 1517356800, 1518220800, 1519257600, 1519948800, 1520812800,
1521504000, 1522022400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Last_day = structure(c(1515628800, 1515110400, 1515542400,
1515542400, 1515628800, 1516579200, 1516320000, 1517184000, 1516233600,
1517184000, 1517702400, 1517184000, 1517616000, 1517702400, 1518220800,
1518825600, 1519689600, 1520812800, 1521763200, 1522108800, 1522108800
), tzone = "UTC", class = c("POSIXct", "POSIXt"))), row.names = c(NA,
-21L), class = c("tbl_df", "tbl", "data.frame"))
Ideally, the output would return a single date range for all overlapping dates, but if there were a stretch of days that are missed by each, then a new interval would be created. So, for group 1, rows 1-5 would all have start = 2018-01-03 and end = 2018-01-11, but then row 6 would have start = 2018-01-15 and end = 2018-01-22.
I've tried to do the following:
df %>%
arrange(CSN_id, First_day) %>%
mutate(First_day = ymd(First_day),
Last_day = ymd(Last_day),
start = ymd("1970-01-01"),
end = ymd("1970-01-01")) %>%
group_by(CSN_id) %>%
rowwise() %>%
mutate(test = if_else(row_number() == 1, interval(First_day, Last_day), interval(lag(start), lag(end))),
start = if_else(row_number() == 1, First_day,
if_else(First_day <= lag(end), lag(First_day), First_day)),
end = if_else(row_number() == 1, Last_day,
if_else(Last_day %within% lag(test) == TRUE, lag(end), Last_day)))
However, I don't think the lag function is working as intended, and it always returns Last_day for some reason. I tried getting rid of rowwise, but then the intervals get messed up (persistently stuck in 1970s).
The output I'm getting is:
structure(list(CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), First_day = structure(c(17534,
17534, 17535, 17535, 17536, 17546, 17548, 17548, 17548, 17549,
17549, 17559, 17559, 17566, 17562, 17572, 17584, 17592, 17602,
17610, 17616), class = "Date"), Last_day = structure(c(17542,
17536, 17541, 17541, 17542, 17553, 17550, 17560, 17549, 17560,
17566, 17560, 17565, 17566, 17572, 17579, 17589, 17602, 17613,
17617, 17617), class = "Date"), start = structure(c(17534, 17534,
17535, 17535, 17536, 17546, 17548, 17548, 17548, 17549, 17549,
17559, 17559, 17566, 17562, 17572, 17584, 17592, 17602, 17610,
17616), class = "Date"), end = structure(c(17542, 17536, 17541,
17541, 17542, 17553, 17550, 17560, 17549, 17560, 17566, 17560,
17565, 17566, 17572, 17579, 17589, 17602, 17613, 17617, 17617
), class = "Date"), test = new("Interval", .Data = c(691200,
172800, 518400, 518400, 518400, 604800, 172800, 1036800, 86400,
950400, 1468800, 86400, 518400, 0, 864000, 604800, 432000, 864000,
950400, 604800, 86400), start = structure(c(1514937600, 1514937600,
1515024000, 1515024000, 1515110400, 1515974400, 1516147200, 1516147200,
1516147200, 1516233600, 1516233600, 1517097600, 1517097600, 1517702400,
1517356800, 1518220800, 1519257600, 1519948800, 1520812800, 1521504000,
1522022400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
tzone = "UTC")), class = c("rowwise_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -21L), groups = structure(list(
CSN_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .rows = structure(list(
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L,
14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame")))
Is there something obvious I'm missing? Any help would be much appreciated!