0

Following data.table

df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
                 start_date=c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24"),
                 end_date=c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24"),
                 variable1=c("a","c","c","d","a",NA,"a","a","b"))
df                 
id start_date   end_date variable1
1:  1 2019-05-08 2019-09-08         a
2:  2 2019-08-01 2019-12-01         c
3:  2 2019-07-12 2019-07-30         c
4:  2 2017-05-24 2017-11-24         d
5:  3 2016-05-08 2017-07-25         a
6:  3 2017-08-01 2018-08-01      <NA>
7:  4 2019-06-12 2019-12-12         a
8:  4 2017-02-24 2017-08-24         a
9:  4 2017-08-24 2018-08-24         b

Within the same ID, I want to compare the start_date and end_date. If the end_date of one row is within 30 days of the start_date of another row, I want to combine the rows. So that it looks like this:

id start_date   end_date variable1
1:  1 2019-05-08 2019-09-08         a
2:  2 2019-07-12 2019-12-01         c
3:  2 2017-05-24 2017-11-24         d
4:  3 2016-05-08 2018-08-01         a
5:  4 2019-06-12 2019-12-12         a
6:  4 2017-02-24 2017-08-24         a
7:  4 2017-08-24 2018-08-24         b

If the other variables of the rows are the same, rows should be combined with the earliest start_date and latest end_date as id number 2. If the variable1 is NA it should be replaced with values from the matching row as id number 3. If the variable1 has different values, rows should remain separate as id number 4. The data.table contains more variables and objects than displayed here. Preferable a function in data.table.

Svenja
  • 119
  • 5
  • What have you tried? Where are you stuck? – s_baldur Aug 14 '20 at 16:30
  • Does this answer your question? [How to compare two rows of dates in R and transform into one row](https://stackoverflow.com/questions/62871548/how-to-compare-two-rows-of-dates-in-r-and-transform-into-one-row) I answered a data.table + intervals solution in this question. – Wimpel Aug 14 '20 at 16:33
  • I am getting following error message `error in evaluating the argument 'x' in selecting a method for function 'close_intervals': error in evaluating the argument 'x' in selecting a method for function 'contract': error in evaluating the argument 'x' in selecting a method for function 'reduce': error in evaluating the argument 'x' in selecting a method for function 'expand': invalid class “Intervals” object: The 'Intervals' classes are based on two-column, numeric matrices.` I converted the dates to `as.numeric`. – Svenja Aug 14 '20 at 20:37
  • See also [identify consecutively overlapping segments in R](https://stackoverflow.com/questions/52091907/identify-consecutively-overlapping-segments-in-r) – Henrik Aug 24 '20 at 08:28

1 Answers1

1

Not clear what happens if an id has 3 overlapping rows with variable1 = c('a', NA, 'b'), what should the variable1 be for the NA for this case? a or b?

If we just choose the first variable1 when there are multiple matches, here is an option to first fill the NA and then borrow the idea from David Aurenburg's solution here

setorder(df, id, start_date, end_date)
df[, end_d := end_date + 30L]

df[is.na(variable1), variable1 :=
    df[!is.na(variable1)][.SD, on=.(id, start_date<=start_date, end_d>=start_date), mult="first", x.variable1]]

df[, g:= c(0L, cumsum(shift(start_date, -1L) > cummax(as.integer(end_d)))[-.N]), id][,
    .(start_date=min(start_date), end_date=max(end_date)), .(id, variable1, g)]

output:

   id variable1 g start_date   end_date
1:  1         a 0 2019-05-08 2019-09-08
2:  2         d 0 2017-05-24 2017-11-24
3:  2         c 1 2019-07-12 2019-12-01
4:  3         a 0 2016-05-08 2018-08-01
5:  4         a 0 2017-02-24 2017-08-24
6:  4         b 0 2017-08-24 2018-08-24
7:  4         a 1 2019-06-12 2019-12-12

data:

library(data.table)
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
    start_date=as.IDate(c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24")),
    end_date=as.IDate(c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24")),
    variable1=c("a","c","c","d","a",NA,"a","a","b"))
chinsoon12
  • 25,005
  • 4
  • 25
  • 35