3

I have a table with dateRanges and corresponding IDs. I want to group the IDs based on whether their start/end range overlaps with the date range for another ID. If a date range for an ID is partially or completely within that for another ID, they should belong to the same group. I want to add a column indicating this grouping, alongside the start/end date as given by the smallest and largest dates within the group.

The data:

"ID"    "start" "end"
1   2018-10-02  2019-01-15
2   2019-01-13  2019-02-01
3   2018-10-01  2018-11-01
4   2018-10-05  2018-10-06
5   2019-09-09  2019-10-08
6   2019-02-06  2019-04-07
7   2019-03-24  2019-04-17
8   2019-03-21  2019-04-14
9   2019-03-27  2019-04-16
10  2019-04-30  2019-05-08

The ideal result:

"ID"    "start" "end"   "group_ID"  "group_start"   "group_end"
1   2018-10-02  2019-01-15  1   2018-10-01  2019-02-01
2   2019-01-13  2019-02-01  1   2018-10-01  2019-02-01
3   2018-10-01  2018-11-01  1   2018-10-01  2019-02-01
4   2018-10-05  2018-10-06  1   2018-10-01  2019-02-01
5   2019-09-09  2019-10-08  2   2019-09-09  2019-10-08
6   2019-02-06  2019-04-07  3   2019-02-06  2019-05-08
7   2019-03-24  2019-04-17  3   2019-02-06  2019-05-08
8   2019-03-21  2019-04-14  3   2019-02-06  2019-05-08
9   2019-03-27  2019-04-16  3   2019-02-06  2019-05-08
10  2019-04-30  2019-05-08  3   2019-02-06  2019-05-08

What I've been thinking of that may work is creating a matrix of IDs (i.e.- rows and columns spanning from ID 1 to ID 10) and filling each cell on whether the date ranges for the given intersection of IDs overlap. Following this, binning then into groups and finding the min/max for the given group, but this seems really complicated. There must be an easier solution that does not involve looking at edges on a matrix to create clusters.

Edit- format for .csv:

ID,start,end
1,2018-10-02,2019-01-15
2,2019-01-13,2019-02-01
3,2018-10-01,2018-11-01
4,2018-10-05,2018-10-06
5,2019-09-09,2019-10-08
6,2019-02-06,2019-04-07
7,2019-03-24,2019-04-17
8,2019-03-21,2019-04-14
9,2019-03-27,2019-04-16
10,2019-04-30,2019-05-08
CorerMaximus
  • 653
  • 5
  • 15
  • How is group 2 created if the previous dates are in 2018 and the current row is in 2019 ? there are no overlapping matches. Same for group 3. Can you explain where is the overlap exactly ? – Ronak Shah Dec 06 '19 at 01:21
  • If two or more IDs have overlapping date ranges, they should be in the same group. Since group 2's start and end don't fall under any of the intervals for the groups 1 and 3, it is in its own group. – CorerMaximus Dec 06 '19 at 02:46
  • For ID6, there is no overlapping with other members of group 3, so should you get 4 groups instead of 3 ? Also, can you provide a reproducible example fo your initial dataframe and of the expected dataframe ? It will make things easier for people trying to help. Check this link to know how to do it: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – dc37 Dec 06 '19 at 03:52
  • That was an error on my end, just updated it! – CorerMaximus Dec 06 '19 at 05:06
  • 1
    please provide a reproducible example (https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). As it is we can copy paste it easily in R and try to troubleshoot your issue. You can add the output of `dput(df[1:10,])` – dc37 Dec 06 '19 at 05:14
  • I've just attached a version with commas. I didn't realize stackoverflow converts tabs to spaces when posting, my apologies and thanks for the recommendation! – CorerMaximus Dec 06 '19 at 17:45

1 Answers1

0

Here is an option:

setorder(DT, start, end)
DT[order(start, end), g := cumsum(start > shift(cummax(as.integer(end)), fill=0L))][,
    c("gstart","gend") := .(min(start), max(end)), g]    

output:

    ID      start        end g     gstart       gend
 1:  1 2018-10-02 2019-01-15 1 2018-10-01 2019-02-01
 2:  2 2019-01-13 2019-02-01 1 2018-10-01 2019-02-01
 3:  3 2018-10-01 2018-11-01 1 2018-10-01 2019-02-01
 4:  4 2018-10-05 2018-10-06 1 2018-10-01 2019-02-01
 5:  5 2019-09-09 2019-10-08 4 2019-09-09 2019-10-08
 6:  6 2019-02-06 2019-04-07 2 2019-02-06 2019-04-17
 7:  7 2019-03-24 2019-04-17 2 2019-02-06 2019-04-17
 8:  8 2019-03-21 2019-04-14 2 2019-02-06 2019-04-17
 9:  9 2019-03-27 2019-04-16 2 2019-02-06 2019-04-17
10: 10 2019-04-30 2019-05-08 3 2019-04-30 2019-05-08

data:

library(data.table)
DT <- fread("ID,start,end
1,2018-10-02,2019-01-15
2,2019-01-13,2019-02-01
3,2018-10-01,2018-11-01
4,2018-10-05,2018-10-06
5,2019-09-09,2019-10-08
6,2019-02-06,2019-04-07
7,2019-03-24,2019-04-17
8,2019-03-21,2019-04-14
9,2019-03-27,2019-04-16
10,2019-04-30,2019-05-08")
cols <- c("start", "end")
DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]

Reference: How to flatten / merge overlapping time periods

chinsoon12
  • 25,005
  • 4
  • 25
  • 35