1

Problem: I have records with a start and end date for an intervention and I want to merge the rows according to the following rule:

For each ID, any intervention that begins within one year of the last intervention ending, merge the rows so that the start_date is the earliest start date of the two rows, and the end_date is the latest end_date of the two rows. I also want to keep track of intervention IDs if they are merged.

There can be five scenarios:

  1. Two rows have the same start date, but different end dates.

Start date....End date

Start date.........End date

  1. The period between row 2's start and end date lies within the period of row 1's start and end date.

Start date...................End date

.......Start date...End date

  1. Row 2's intervention starts within Row 1's intervention period but ends later.

Start date.....End date

.....Start date.............End date

  1. Row 2 starts within one year of the end of Row 1.

Start date....End date

......................|....<= 1 year....|Start date...End date

  1. Row 2 starts over one year after the end of Row 1.

Start date...End date

.....................|........ > 1 year..........|Start date...End date

I want to merge rows in cases 1 to 4 but not 5.

Data:

library(data.table)
sample_data <- data.table(id = c(rep(11, 3), rep(21, 4)),
    start_date = as.Date(c("2013-01-01", "2013-01-01", "2013-02-01", "2013-01-01", "2013-02-01", "2013-12-01", "2015-06-01")),
    end_date = as.Date(c("2013-06-01", "2013-07-01", "2013-05-01", "2013-07-01", "2013-09-01", "2014-01-01", "2015-12-01")),
    intervention_id = as.character(1:7),
    all_ids = as.character(1:7))

> sample_data
   id start_date   end_date intervention_id all_ids
1: 11 2013-01-01 2013-06-01               1       1
2: 11 2013-01-01 2013-07-01               2       2
3: 11 2013-02-01 2013-05-01               3       3
4: 21 2013-01-01 2013-07-01               4       4
5: 21 2013-02-01 2013-09-01               5       5
6: 21 2013-12-01 2014-01-01               6       6
7: 21 2015-06-01 2015-12-01               7       7

The final result should look like:

> merged_data
   id start_date   end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01               1 1, 2, 3
2: 21 2013-01-01 2014-01-01               4 4, 5, 6
3: 21 2015-06-01 2015-12-01               7       7

I'm not sure if the all_ids column is the best way to keep track of the intervention_id's so open to ideas for that. (The intervention_id's don't need to be in order in the all_ids column.)

It doesn't matter what the value of the intervention_id column is where rows have been merged.

What I tried:

I started off by writing a function to deal with only those cases where the start date is the same. It's a very non-R, non-data.table way of doing it and therefore very inefficient.

mergestart <- function(unmerged) {
  n <- nrow(unmerged)
  mini_merged <- data.table(id = double(n),
                     start_date = as.Date(NA),
                     end_date = as.Date(NA),
                     intervention_id = character(n),
                     all_ids = character(n))

  merge_a <- function(unmerged, un_i, merged, m_i, no_of_records) {
    merged[m_i] <- unmerged[un_i]
    un_i <- un_i + 1

    while (un_i <= no_of_records) {
      if(merged[m_i]$start_date == unmerged[un_i]$start_date) {
        merged[m_i]$end_date <- max(merged[m_i]$end_date, unmerged[un_i]$end_date)
        merged[m_i]$all_ids <- paste0(merged[m_i]$all_ids, ",", unmerged[un_i]$intervention_id)
        un_i <- un_i + 1
      } else {
        m_i <- m_i + 1
        merged[m_i] <- unmerged[un_i]
        un_i <- un_i + 1

        merge_a(unmerged, un_i, merged, m_i, (no_of_records - un_i))
      }
    }
    return(merged)  
  }

  mini_merged <- merge_a(unmerged, 1, mini_merged, 1, n)
  return(copy(mini_merged[id != 0]))
}

Using this function on just one id gives:

> mergestart(sample_data[id == 11])
   id start_date   end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01               1     1,2
2: 11 2013-02-01 2013-05-01               3       3

To use the function on the whole dataset:

n <- nrow(sample_data)
all_merged <- data.table(id = double(n),
                     start_date = as.Date(NA),
                     end_date = as.Date(NA),
                     intervention_id = character(n),
                     all_ids = character(n))

start_i <- 1
for (i in unique(sample_data$id)) {
  id_merged <- mergestart(sample_data[id == i])

  end_i <- start_i + nrow(id_merged) - 1
  all_merged[start_i:end_i] <- copy(id_merged)
  start_i <- end_i
}
all_merged <- all_merged[id != 0]

> all_merged
   id start_date   end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01               1     1,2
2: 21 2013-01-01 2013-07-01               4       4
3: 21 2013-02-01 2013-09-01               5       5
4: 21 2013-12-01 2014-01-01               6       6
5: 21 2015-06-01 2015-12-01               7       7

I also had a look at rolling joins but still can't get how to use it in this situation.

This answer https://stackoverflow.com/a/48747399/6170115 looked promising but I don't know how to integrate all the other conditions and track the intervention IDs with this method.

Can anyone point me in the right direction?

happyspace
  • 113
  • 1
  • 2
  • 12
  • 1
    I wrote a function to tackle a similar problem [that's on Github](https://github.com/TelethonKids/biometrics/blob/master/R/group_interval.R) that groups hospital admission records into episodes of care. You could use this, group by episode and treatment ids and summarise to find min and max dates. This function isn't computationally efficient and may take a while on large data sets. – Paul Oct 01 '18 at 07:33
  • 2
    This may get you started: [Consolidate rows based on date ranges](https://stackoverflow.com/questions/39046210/consolidate-rows-based-on-date-ranges?noredirect=1&lq=1). – Henrik Oct 01 '18 at 08:15

1 Answers1

4

There are related questions How to flatten / merge overlapping time periods and Consolidate rows based on date ranges but none of them has the additional requirements posed by the OP.

library(data.table)
# ensure rows are ordered
setorder(sample_data, id, start_date, end_date)
# find periods
sample_data[, period := {
  tmp <- as.integer(start_date)
  cumsum(tmp > shift(cummax(tmp + 365L), type = "lag", fill = 0L))
}, by = id][]
   id start_date   end_date intervention_id all_ids period
1: 11 2013-01-01 2013-06-01               1       1      1
2: 11 2013-01-01 2013-07-01               2       2      1
3: 11 2013-02-01 2013-05-01               3       3      1
4: 21 2013-01-01 2013-07-01               4       4      1
5: 21 2013-02-01 2013-09-01               5       5      1
6: 21 2013-12-01 2014-01-01               6       6      1
7: 21 2015-06-01 2015-12-01               7       7      2

For the sake of simplicity, it is assumed that one year has 365 days which ignores leap years with 366 days. If leap years are to be considered, a more sophisticated date arithmetic is required.

Unfortunately, cummax() has no method for arguments of class Date or IDate (data.table's integer version). Therefore, the coersion from Date to integer is required.

# aggregate
sample_data[, .(start_date = start_date[1L], 
                end_date = max(end_date), 
                intervention_id = intervention_id[1L], 
                all_ids = toString(intervention_id)), 
            by = .(id, period)]
   id period start_date   end_date intervention_id all_ids
1: 11      1 2013-01-01 2013-07-01               1 1, 2, 3
2: 21      1 2013-01-01 2014-01-01               4 4, 5, 6
3: 21      2 2015-06-01 2015-12-01               7       7

Edit: Correction

I just noted that I had misinterpreted OP's requirements. The OP has requested (emphasis mine):

For each ID, any intervention that begins within one year of the last intervention ending, merge the rows so that the start_date is the earliest start date of the two rows, and the end_date is the latest end_date of the two rows.

The solution above looks for gaps of one year in the sequence of start_date but not in the sequence of start_date and the preceeding end_date as requested. The corrected version is:

library(data.table)
# ensure rows are ordered
setorder(sample_data, id, start_date, end_date)
# find periods
sample_data[, period := cumsum(
  as.integer(start_date) > shift(
    cummax(as.integer(end_date) + 365L), type = "lag", fill = 0L))
  , by = id][]
# aggregate
sample_data[, .(start_date = start_date[1L], 
                end_date = max(end_date), 
                intervention_id = intervention_id[1L], 
                all_ids = toString(intervention_id)), 
            by = .(id, period)]
   id period start_date   end_date intervention_id all_ids
1: 11      1 2013-01-01 2013-07-01               1 1, 2, 3
2: 21      1 2013-01-01 2014-01-01               4 4, 5, 6
3: 21      2 2015-06-01 2015-12-01               7       7

The result for the given sample dataset is identical for both versions which caused the error to slip through unrecognized.

Benchmark

The OP has mentioned in a comment that using lubridate's date arithmetic has dramatically enlarged run times.

According to my benchmark below, the penalty of using end_date %m+% years(1) is not that much. I have benchmarked three versions of the code:

  • v_1 is the corrected version from above.
  • v_2 pulls the type conversion and the data arithmetic out of the grouping part and creates two helper columns in advance.
  • v_3 is like v_2 but uses end_date %m+% years(1).

The benchmark is repeated for different problem sizes, i.e., total number of rows. Also, the number of different ids is varied as grouping may have an effect on performance. According to the OP, his full dataset of 500 k rows has 250 k unique ids which corresponds to an id_share of 0.5 (50%). In the benchmark id_shares of 0.5, 0.2, and 0.01 (50%, 20%, 1%) are simulated.

As sample_data is modified, each run starts with a fresh copy.

library(bench)
library(magrittr)
bm <- press(
  id_share = c(0.5, 0.2, 0.01),
  n_row = c(1000L, 10000L, 1e5L),
  {
    n_id <- max(1L, as.integer(n_row * id_share))
    print(sprintf("Number of ids: %i", n_id))
    set.seed(123L)
    sample_data_0 <- lapply(seq(n_id), function(.id) data.table(
      start_date = as.IDate("2000-01-01") + cumsum(sample(0:730, n_row / n_id, TRUE))
    )) %>% 
      rbindlist(idcol = "id") %>% 
      .[, end_date := start_date + sample(30:360, n_row, TRUE)] %>% 
      .[, intervention_id := as.character(.I)]
    mark(
      v_1 = {
        sample_data <- copy(sample_data_0)
        setorder(sample_data, id, start_date, end_date)
        sample_data[, period := cumsum(
          as.integer(start_date) > shift(
            cummax(as.integer(end_date) + 365L), type = "lag", fill = 0L))
          , by = id]
        sample_data[, .(start_date = start_date[1L], 
                        end_date = max(end_date), 
                        intervention_id = intervention_id[1L], 
                        all_ids = toString(intervention_id)), 
                    by = .(id, period)]
      },
      v_2 = {
        sample_data <- copy(sample_data_0)
        setorder(sample_data, id, start_date, end_date)
        sample_data[, `:=`(start = as.integer(start_date), 
                           end = as.integer(end_date) + 365)]
        sample_data[, period := cumsum(start > shift(cummax(end), type = "lag", fill = 0L))
                    , by = id]
        sample_data[, .(start_date = start_date[1L], 
                        end_date = max(end_date), 
                        intervention_id = intervention_id[1L], 
                        all_ids = toString(intervention_id)), 
                    by = .(id, period)]
        },
      v_3 = {
        sample_data <- copy(sample_data_0)
        setorder(sample_data, id, start_date, end_date)
        sample_data[, `:=`(start = as.integer(start_date), 
                           end = as.integer(end_date %m+% years(1)))]
        sample_data[, period := cumsum(start > shift(cummax(end), type = "lag", fill = 0L))
                    , by = id]
        sample_data[, .(start_date = start_date[1L], 
                        end_date = max(end_date), 
                        intervention_id = intervention_id[1L], 
                        all_ids = toString(intervention_id)), 
                    by = .(id, period)]
      },
      check = FALSE,
      min_iterations = 3
    )
  }
)

ggplot2::autoplot(bm)

enter image description here

The result shows that the number of groups, i.e., number of unique id, does have a stronger effect on the run time than the different code versions. In case of many groups, the creation of helper columns before grouping (v_2) gains performance.

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Re as.integer, I guess they should just `:=` overwrite with IDate before starting, eh. – Frank Oct 01 '18 at 15:23
  • 1
    @Frank `cummax()` does work neither with `Date` nor `IDate`. – Uwe Oct 01 '18 at 22:00
  • Amazing, thank you @Uwe. I'm going to see if I can use `lubridate`'s `%m+%` operator somewhere in the arithmetic, then convert to integer for the `cummax`. Just so you know, your whole procedure took about 20 seconds for 500k rows! – happyspace Oct 01 '18 at 23:44
  • Using `lubridate` to add a year before converting to integer `tmp_yr <- as.integer(start_date %m+% years(1))`, extends the run time from about 20 seconds to over 15 mins. – happyspace Oct 02 '18 at 00:36
  • @Uwe What is the purpose of `cummax()`? Isn't it operating on individual elements of the vector? – happyspace Oct 02 '18 at 06:03
  • 1
    `cummax()` is the _cumulative maximum_. `cummax(x)` returns a vector of the same length as `x` which contains the largest value encountered _so far_. So, `cummax(c(1, 0, 3))` returns `[1] 1 1 3`. – Uwe Oct 02 '18 at 08:24
  • Using helper columns and `%m+%`, the run time was under 20 seconds on my full dataset of ~500k rows. Love your work @Uwe. – happyspace Oct 02 '18 at 23:25
  • Great! How many different `id` does your full dataset have? I am asking because the number of groups seems to have strong impact on run time. – Uwe Oct 03 '18 at 03:13
  • There are about 250k unique `id`s. So I think you are right about the groups. – happyspace Oct 03 '18 at 05:40
  • 1
    I have re-run the benchmark to include the case of an `id_share` of 0.5 (50%) according to your 250 k groups / 500 k rows. – Uwe Oct 03 '18 at 09:31