I have a data set with IDs, begin date, end date, activity status. The current data set counts 150.000 rows and 50.000 IDs but will be much bigger.
I would like with R regrouping on the same row for the same ID and the same periods (begin date and end dates), the different activities occurring on an overlapping period.
In other words, I try to: 1) reshaping the data to get overlap and non-overlap periods in specific rows. 2) pasting together overlapping activity status.
I already tried:
Find overlapping dates for each ID and create a new row for the overlap
and
Convert data of overlapping time ranges to data of distinct time ranges
Both of these solutions are very good and helped me a lot.
However these two approaches cannot handle in a one piece all the data set of 150.000 rows. I used the second solution by dividing my data set. I needed 2 hours to analyse 18 000 rows.
I wonder if some solution could be more efficient on R. I was thinking using mutate but I'm not very familiar with that function.
The input data looks like this:
ID START END STATUS
5c0e83 2013-11-01 2015-01-01 P1
5c0e83 2002-09-01 2003-09-01 F2
5c0e83 2016-09-01 2016-09-01 F0
5c0e83 2006-01-01 2007-01-01 P3
5c0e83 2003-11-01 2013-11-01 P2
5c0e83 2015-09-01 2018-01-01 P0
5c0e83 2003-09-01 2005-09-01 F1
Thanks the script previously cited the output have been like this (Convert data of overlapping time ranges to data of distinct time ranges)
ID STATUS START END
5c0e83 F3 2002-09-01 2002-09-01
5c0e83 F3 2002-09-01 2002-10-01
5c0e83 F3 2002-10-01 2002-10-01
5c0e83 F3 2002-10-01 2002-11-01
5c0e83 F3 2002-11-01 2002-11-01
5c0e83 F3 2002-11-01 2003-01-01
5c0e83 F3 2003-01-01 2003-01-01
5c0e83 F3 2003-01-01 2003-09-01
5c0e83 F3, F2 2003-09-01 2003-09-01
5c0e83 F2 2003-09-01 2003-10-01
5c0e83 F2 2003-10-01 2003-10-01
5c0e83 F2 2003-10-01 2003-11-01
5c0e83 F2, P2 2003-11-01 2003-11-01
5c0e83 F2, P2 2003-11-01 2004-01-01
5c0e83 F2, P2 2004-01-01 2004-01-01
5c0e83 F2, P2 2004-01-01 2004-09-01
5c0e83 F2, P2 2004-09-01 2004-09-01
5c0e83 F2, P2 2004-09-01 2004-10-01
5c0e83 F2, P2 2004-10-01 2004-10-01
5c0e83 F2, P2 2004-10-01 2004-11-01
The result is good but the time for treatment is very long. I wonder if someone thinks about a more efficient way to perform this task.
Thanks for your help