0

I have a dataframe that looks like this:

ID    Group    Start Date    End Date
1     A        2018-08-08    2018-08-09
1     A        2018-08-07    2018-08-08
1     A        2018-08-05    2018-08-07
1     B        2018-08-08    2018-08-09
1     B        2018-08-07    2018-08-08
2     A        2018-08-08    2018-08-09
2     A        2018-08-07    2018-08-08
2     A        2018-08-01    2018-08-07
2     B        2018-08-08    2018-08-09
2     B        2018-08-07    2018-08-08
3     B        2018-08-07    2018-08-08
4     B        2018-08-07    2018-08-08
4     B        2018-08-01    2018-08-07

And I'd like collapse it so that if the Start Date variable in one row matches the End Date variable in the next row, the rows are combined while grouping by the ID and Group variables:

ID    Group    Start Date    End Date
1     A        2018-08-05    2018-08-09
1     B        2018-08-07    2018-08-09
2     A        2018-08-01    2018-08-09
2     B        2018-08-07    2018-08-09
3     B        2018-08-07    2018-08-08
4     B        2018-08-01    2018-08-08

In dplyr, I know it's possible to do this through something similar to the following:

df %>% 
group_by(ID, Group) %>%
rowwise() %>%
do(somefunction(x){})

But I need help in writing the function. Or if there's a for loop or other implementation that could accomplish this, it would be much appreciated.

Edit I've modified the example data for more clarity.

hvgupta
  • 189
  • 12
  • relevant: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods-in-r/28938694#28938694 and https://stackoverflow.com/questions/52091907/identify-consecutively-overlapping-segments-in-r. I think your dataset is not minimal enough as it does not contain cases where rows are not consecutive date periods. – chinsoon12 Sep 21 '18 at 00:40
  • @chinsoon12 You're correct. And in fact, the solution here wouldn't distinguish non-consecutive date periods. Ultimately, the function in this answer worked best: https://stackoverflow.com/a/40718415/10292275. It preserves non-consecutive date ranges while also letting users adjust the amount of time between ranges to combine them. – hvgupta Sep 24 '18 at 20:17

1 Answers1

1

I've found a more flexible solution using the IRanges package from BioConducter. This works well for cases which are not necessarily consecutive to their neighbors:

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
library(data.table)
library(lubridate)

setDT(df)

df[, , as.data.table(reduce(
       IRanges(as.numeric(`Start Date`),
               as.numeric(`End Date`)),
               min.gapwidth=1L))[, lapply(.SD, as_date), .SDcols = -"width"], 
                                 .(`ID`, `Group`)]

In the snippet above, I'm using the reduce function from IRanges on each of the date ranges, which collapses adjacent ranges within a given tolerance, min.gapwidth. In this case, I've specified date ranges to be combined if they're within one day of each other (min.gapwidth=1L). You can set this to 0L to combine strictly consecutive dates.

IRanges only takes numeric values, so I convert the dates to numeric and then I apply the as_date function from the lubridate package to convert them back to dates. I've also removed the width column which would tell me how many days are within each range.

hvgupta
  • 189
  • 12
  • I like this solution! its a bit complicated for someone who doesn't have strong familiarity with data.table and bioconductor, but it works like a charm! One thing to note: i think you have an extra comma in the `df[, , as.data.table(`.... section. – Reilstein Oct 08 '20 at 02:50