21

I have a large data set of time periods, defined by a 'start' and and an 'end' column. Some of the periods overlap.

I would like to combine (flatten / merge / collapse) all overlapping time periods to have one 'start' value and one 'end' value.

Some example data:

  ID      start        end
1  A 2013-01-01 2013-01-05
2  A 2013-01-01 2013-01-05
3  A 2013-01-02 2013-01-03
4  A 2013-01-04 2013-01-06
5  A 2013-01-07 2013-01-09
6  A 2013-01-08 2013-01-11
7  A 2013-01-12 2013-01-15

Desired result:

  ID      start        end
1  A 2013-01-01 2013-01-06
2  A 2013-01-07 2013-01-11
3  A 2013-01-12 2013-01-15

What I have tried:

  require(dplyr)
  data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"), 
    start = structure(c(1356998400, 1356998400, 1357084800, 1357257600, 
    1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200, 
    1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct", 
    "POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA, 
-7L), class = "data.frame")

remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)  
}
data2 <- na.omit(data2)}

data <- remove.overlaps(data)
Uwe
  • 41,420
  • 11
  • 90
  • 134
Jonno Bourne
  • 1,931
  • 1
  • 22
  • 45

6 Answers6

22

Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups

data %>%
  arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
# 
#   ID indx      start        end
# 1  A    0 2013-01-01 2013-01-06
# 2  A    1 2013-01-07 2013-01-11
# 3  A    2 2013-01-12 2013-01-15
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks for such a great answer! Question though, when I used the function on the real data set the dates ended up kept in second format I had to wrap the summarise variables in as.POSIXct() to convert them back any ideas why? – Jonno Bourne Mar 10 '15 at 15:01
  • Not sure what you mean by that... When I save the result in some variable , both `start` and `end` are of class `POSIXct`... – David Arenburg Mar 10 '15 at 15:09
  • 2
    Btw if you use multiple ID's you have to arrange by arrange(data, ID, start) as lag is not affected by grouping and so may take the dates from outside the ID group messing up the final structure. This wasn't part of the question but I found out the hard way afterwords. – Jonno Bourne May 22 '15 at 13:35
  • What does the `[-n()]` do? I was able to adapt this to my own needs (similar situation but with an allowance of 90 days between dates still counting as "overlapping") but I had to copy the `[-n()]` verbatim without really understanding what it does. – Dannid Feb 01 '19 at 23:34
  • 1
    Aha! I figured it out. (it's removing the last item in the `cumsum` to accommodate the added `0` at the beginning of the vector.) – Dannid Feb 01 '19 at 23:43
  • @DavidArenburg I think your solution makes the assumption that the data is ordered by the start date. If you change the order of the rows, the result might also changes. I'll therefore suggest adding arrange(ID,start) after the grouping is done. – peer May 02 '19 at 12:52
  • @peer, yes, this was already mentioed in the comments above and the other answers. – David Arenburg May 02 '19 at 20:16
  • Following up on @Dannid 's comment, how can one adapt this code to allow touching or very close intervals to be merged, i.e. let's say <5-day gap? – Fred-LM Aug 28 '19 at 15:10
  • @Dannid Can you explain a bit more? I know n() as a row count and [] for indexing, but the cumsum is an integer. So if we're grouping by ID and there are 7 rows, what does it mean to get the -7th index of an integer within a vector? How does c(0, 2[-7]) return 1? – Nicholas Hassan Aug 11 '22 at 18:25
  • 1
    @NicholasHassan The cumsum output is a vector of cumulative sums for each entry. In this case, we're getting the cumulative sum of the following: is the lead(start) later than the max end for this group? Since this is grouped by ID, the sum - that is the number of times the lead(start) for the group exceeds the max(end) will be some integer. I'm not sure if it's guaranteed that the output is a vector of ever-increasing values, but that happened in the example (the data is first sorted by start). To create a 0-based index, we'll need to prepend "0" and then remove the last element (-n()). – Dannid Aug 25 '22 at 17:40
  • Replacing first and last by min and max works as stated in the next post. This code gives an error in some cases (explained in the next post also). – gastan Aug 25 '23 at 10:24
17

@David Arenburg's answer is great - but I ran into an issue where an earlier interval ended after a later interval - but using last in the summarise call resulted in the wrong end date. I'd suggest changing first(start) and last(end) to min(start) and max(end)

data %>%
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = min(start), end = max(end))

Also, as @Jonno Bourne mentioned, sorting by start and any grouping variables is important before applying the method.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
zack
  • 5,205
  • 1
  • 19
  • 25
  • Can you explain what [-n()] means. I know n() as a row count and [] for indexing, but the cumsum is an integer. So if we're grouping by ID and there are 7 rows, what does it mean to get the -7th index of an integer within a vector? How does c(0, 2[-7]) return 1? – Nicholas Hassan Aug 11 '22 at 18:32
  • using `?`[`: "For [-indexing only: i, j, ... can be logical vectors, indicating elements/slices to select. Such vectors are recycled if necessary to match the corresponding extent. i, j, ... can also be negative integers, indicating elements/slices to leave out of the selection." thus `[-n()]` leaves out the last element of the group – zack Aug 12 '22 at 13:09
6

For the sake of completeness, the IRanges package on Bioconductor has some neat functions which can be used to deal with date or date time ranges. One of it is the reduce() function which merges overlapping or adjacent ranges.

However, there is a drawback because IRanges works on integer ranges (hence the name), so the convenience of using IRanges functions comes at the expense of converting Date or POSIXct objects to and fro.

Also, it seems that dplyr doesn't play well with IRanges (at least judged by my limited experience with dplyr) so I use data.table:

library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)

setDT(data)[, {
  ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
  .(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
       ID      start        end
   <fctr>     <POSc>     <POSc>
1:      A 2013-01-01 2013-01-06
2:      A 2013-01-07 2013-01-11
3:      A 2013-01-12 2013-01-15

A code variant is

setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
  , lapply(.SD, as_datetime), .SDcols = -"width"], 
  by = ID]

In both variants the as_datetime() from the lubridate packages is used which spares to specify the origin when converting numbers to POSIXct objects.

Would be interesting to see a benchmark comparision of the IRanges approaches vs David's answer.

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Other than collapsing rows that have overlapping intervals, if I would also like to take the minimum value of another column, how can we do that? e.g. `data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L), .Label = "A", class = "factor"), start = structure(c(15706, 15706, 15707, 15709), class = "Date"), end = structure(c(15710, 15710, 15708, 15711), class = "Date"), value = c(3L, 7L, 8L, 5L)), class = "data.frame", row.names = c(NA, -4L))` then the `value` column gives `3`. – HNSKD Jun 03 '20 at 07:21
  • @HNSKD, this should be posted as a separate question with its own [mcve], please. But a quick answer is: `library(data.table); setDT(data)[order(start, end), grp := cumsum(cummax(shift(as.numeric(end), fill = 0)) < as.numeric(start))][, .(start = min(start), end = max(end), value = min(value)), by = grp]` – Uwe Jun 03 '20 at 07:41
3

I think that you can solve this problem pretty nicely with dplyr and the ivs package, which is designed for working with interval vectors, exactly like what you have here. It is inspired by IRanges, but is more suitable for use in the tidyverse and is completely generic so it can handle date intervals automatically (no need to convert to numeric and back).

The key is to combine the start/end boundaries into a single interval vector column, and then use iv_groups(). This merges all of the overlapping intervals in the interval vector and returns the intervals that remain after the overlaps have been merged.

It seems like you want to do this by ID, so I've also grouped by ID.

library(ivs)
library(dplyr)

data <- tribble(
  ~ID,       ~start,         ~end,
  "A", "2013-01-01", "2013-01-05",
  "A", "2013-01-01", "2013-01-05",
  "A", "2013-01-02", "2013-01-03",
  "A", "2013-01-04", "2013-01-06",
  "A", "2013-01-07", "2013-01-09",
  "A", "2013-01-08", "2013-01-11",
  "A", "2013-01-12", "2013-01-15"
) %>%
  mutate(
    start = as.Date(start),
    end = as.Date(end)
  )

data
#> # A tibble: 7 × 3
#>   ID    start      end       
#>   <chr> <date>     <date>    
#> 1 A     2013-01-01 2013-01-05
#> 2 A     2013-01-01 2013-01-05
#> 3 A     2013-01-02 2013-01-03
#> 4 A     2013-01-04 2013-01-06
#> 5 A     2013-01-07 2013-01-09
#> 6 A     2013-01-08 2013-01-11
#> 7 A     2013-01-12 2013-01-15

# Combine `start` and `end` into a single interval vector column
data <- data %>%
  mutate(interval = iv(start, end), .keep = "unused")

# Note that this is a half-open interval!
data  
#> # A tibble: 7 × 2
#>   ID                    interval
#>   <chr>               <iv<date>>
#> 1 A     [2013-01-01, 2013-01-05)
#> 2 A     [2013-01-01, 2013-01-05)
#> 3 A     [2013-01-02, 2013-01-03)
#> 4 A     [2013-01-04, 2013-01-06)
#> 5 A     [2013-01-07, 2013-01-09)
#> 6 A     [2013-01-08, 2013-01-11)
#> 7 A     [2013-01-12, 2013-01-15)

# It seems like you'd want to group by ID, so lets do that.
# Then we use `iv_groups()` which merges all overlapping intervals and returns
# the intervals that remain after all the overlaps have been merged
data %>%
  group_by(ID) %>%
  summarise(interval = iv_groups(interval), .groups = "drop")
#> # A tibble: 3 × 2
#>   ID                    interval
#>   <chr>               <iv<date>>
#> 1 A     [2013-01-01, 2013-01-06)
#> 2 A     [2013-01-07, 2013-01-11)
#> 3 A     [2013-01-12, 2013-01-15)

Created on 2022-04-05 by the reprex package (v2.0.1)

Davis Vaughan
  • 2,780
  • 9
  • 19
2

It looks like I'm a little late to the party, but I took @zach's code and re-wrote it using data.table below. I didn't do comprehensive testing, but this seemed to run about 20% faster than the tidy version. (I couldn't test the IRange method because the package is not yet available for R 3.5.1)

Also, fwiw, the accepted answer doesn't capture the edge case in which one date range is totally within another (e.g., 2018-07-07 to 2017-07-14 is within 2018-05-01 to 2018-12-01). @zach's answer does capture that edge case.

library(data.table)

start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")

# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]

# set keys (also orders)
setkey(d2, ID, start_col, end_col)

# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col, 
                  END_DT = end_col, 
                  indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
               keyby=ID
               ][,.(start=min(START_DT), 
                    end = max(END_DT)),
                 by=c("ID","indx")
                 ]
enmyj
  • 371
  • 4
  • 14
1

Benchmarks along with an even faster data.table solution

First, I echo @enmyj and @zach that the solution in the accepted answer gives erroneous results when there is one range that is completely inside another.

A faster approach that is reminiscent of the one proposed in the accepted answer:

  1. Sort by ID and then all dates (start and end combined).
  2. Subtract the cumulative sum of number of start dates by the cumulative sum of number of end dates.
  3. Find the indices where this sum is 0. The dates on these rows are the end dates of each collection of overlapping date ranges. The dates on the next row are the start dates of the next collection of overlapping date ranges. The indices can also be used to easily perform roll-up calculations of other columns.

This involves just a few vectorized calls and no grouping operations, so it is very performant.

As a function:

flatten <- function(dt) {
  setorder(dt[, rbindlist(.(.(ID, start, 1L), .(ID, end, -1L)))], V1, V2)[
    , .(
      ID = V1[i <- which(!cumsum(V3))],
      start = V2[c(1L, i[-length(i)] + 1L)],
      end = V2[i]
    )
  ]
}

Benchmarks

The benchmarking uses a large-ish data.table.

library(data.table)
library(dplyr)
library(ivs)

data <- data.table(
  ID = sample(1e3, 1e5, 1),
  start = as.Date(sample(1e4:2e4, 1e5, 1), origin = "1970-01-01")
)[, end := start + sample(100)]

fCum <- function(dt) {
  # adapted from https://stackoverflow.com/a/47337684/9463489
  dt %>%
    arrange(ID, start) %>%
    group_by(ID) %>%
    mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                                cummax(as.numeric(end)))[-n()])) %>%
    group_by(ID, indx) %>%
    reframe(start = min(start), end = max(end)) %>%
    select(-indx)
}

fivs <- function(dt) {
  # adapted from https://stackoverflow.com/a/71754454/9463489
  dt %>%
    mutate(interval = iv(start, end), .keep = "unused") %>%
    group_by(ID) %>%
    reframe(interval = iv_groups(interval)) %>%
    mutate(start = iv_start(interval), end = iv_end(interval)) %>%
    select(-interval)
}

squish <- function(dt) {
  # adapted from https://stackoverflow.com/a/53890653/9463489
  setkey(dt, ID, start, end)
  dt[,.(START_DT = start, 
        END_DT = end, 
        indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-.N])),
     keyby=ID
  ][,.(start=min(START_DT), 
       end = max(END_DT)),
    by=c("ID","indx")
  ][, indx := NULL]
}

Timings:

microbenchmark::microbenchmark(
  flatten = flatten(dt),
  fCum = setDT(fCum(dt)),
  fivs = setDT(fivs(dt)),
  squish = squish(dt),
  times = 10,
  check = "equal",
  setup = {dt <- copy(data)}
)
#> Unit: milliseconds
#>     expr       min        lq       mean     median        uq       max neval
#>  flatten   11.4732   11.8141   13.86760   12.36580   15.9228   19.1775    10
#>     fCum 1827.1197 1876.7701 1898.24285 1908.88640 1926.6548 1939.2919    10
#>     fivs  160.2568  163.9617  173.31783  173.32095  177.3789  192.7755    10
#>   squish   62.5197   64.9126   66.26047   65.08515   67.1685   70.9916    10

Aggregating other columns

The approach used by flatten also makes it easy to aggregate other columns in the data.table.

data[, v := runif(1e5)]

setorder(data[, rbindlist(.(.(ID, start, 1L, 0), .(ID, end, -1L, v)))], V1, V2)[
  , .(
    ID = V1[i <- which(!cumsum(V3))],
    start = V2[c(1L, i[-length(i)] + 1L)],
    end = V2[i],
    v = diff(c(0, cumsum(V4)[i]))
  )
]
#>          ID      start        end          v
#>     1:    1 1997-09-25 1997-09-27 0.40898255
#>     2:    1 1997-11-09 1997-11-30 0.44067634
#>     3:    1 1998-04-27 1998-07-17 1.73142460
#>     4:    1 1999-08-05 1999-11-05 0.41103832
#>     5:    1 1999-12-09 2000-01-26 0.90639735
#>    ---                                      
#> 60286: 1000 2023-01-06 2023-03-28 0.54727106
#> 60287: 1000 2023-07-20 2023-10-16 1.74270130
#> 60288: 1000 2024-03-24 2024-06-23 0.07110824
#> 60289: 1000 2024-07-13 2024-07-31 0.63888263
#> 60290: 1000 2024-10-02 2024-10-19 0.22872167
jblood94
  • 10,340
  • 1
  • 10
  • 15