I have a data.table in R, within this data.table are some date ranges (from and to date) and the data can be grouped together by an id. For each id, I would like to combine overlapping date ranges to end up with unique date ranges for each id which do not overlap
e.g. this is a very simplified version of my data
input <- data.table(
id = c(rep(1, 4), rep(2, 3), rep(3, 2)),
from_date = c('2000-01-01', '2001-01-01', '2002-01-01', '2003-01-01',
'2000-01-01', '2001-01-01', '2002-01-01',
'2000-01-01', '2001-01-01'),
to_date = c('2000-06-01', '2003-06-01', '2002-06-01', '2003-10-01',
'2001-05-01', '2001-02-01', '2002-06-01',
'2000-06-01', '2002-06-01')
)
> input
id from_date to_date
1: 1 2000-01-01 2000-06-01
2: 1 2001-01-01 2003-06-01
3: 1 2002-01-01 2002-06-01
4: 1 2003-01-01 2003-10-01
5: 2 2000-01-01 2001-05-01
6: 2 2001-01-01 2001-02-01
7: 2 2002-01-01 2002-06-01
8: 3 2000-01-01 2000-06-01
9: 3 2001-01-01 2002-06-01
For id 1, the second, third and fouth date ranges overlap, so I would need to combine them all into one with the from and to date reflecting the min/max date in the range. For id 2 the second and third date range overlap (indeed, the second date range contains the third date range entirely) so again these would need combining. For the third id they do not overlap so can remain the same
So for the above, i would want to end up with
> output
id from_date to_date
1: 1 2000-01-01 2000-06-01
2: 1 2001-01-01 2003-10-01
3: 2 2000-01-01 2001-05-01
4: 2 2002-01-01 2002-06-01
5: 3 2000-01-01 2000-06-01
6: 3 2001-01-01 2002-06-01
In reality by data.table is a lot bigger (1,000,000s rows with 100,000s groups) and contains other data along with the id and date ranges, and so it is important that any solution is reasonably performant and fast, ideally using data.table rather than dplyr for example, as the rest of the code uses data.tables.
Thanks in advance