1

This question is similar to a question here r collapse by year by ID

However I like to collapse timelines by ID and State , only if they the gap between their timelines is 31 days apart. If the gap is more than 31 days then they are not collapsed, they start with a new row. For example if this is my dataset

ID     From           To           State
1      2004-04-05     2005-02-05   MD
1      2005-03-05     2005-03-05   MD
1      2005-04-05     2005-10-05   DC
1      2006-03-05     2006-10-05   DC
1      2006-11-05     2007-03-05   DC
1      2007-04-05     2007-06-05   MD
1      2008-03-05     2008-11-05   MD
1      2008-12-05     2010-08-05   MD
1      2010-09-05     2012-11-05   MD
2      2003-05-05     2004-08-05   OR
2      2004-09-05     2009-03-05   OR
2      2010-06-05     2010-08-05   AZ
2      2013-06-05     2015-06-05   AZ

The final dataset after collapsing will looks like this

ID     From           To           State

1      2004-04-05     2005-03-05   MD

1      2005-04-05     2005-10-05   DC

1      2006-04-05     2007-03-05   DC

1      2007-04-05     2007-06-05   MD

1      2008-03-05     2012-11-05   MD

2      2003-05-05     2009-03-05   OR

2      2010-06-05     2010-08-05   AZ

2      2013-06-05     2015-06-05   AZ

Any suggestions regarding this is much appreciated.

TestCase 2 :

ID     From           To           State
1      2003-09-05     2003-11-05   MD
1      2004-09-05     2007-05-05   TX
1      2007-06-05     2007-07-05   DC
1      2007-08-05     2009-07-05   DC
1      2011-11-05     2014-03-05   MD
1      2014-05-05     2017-06-05   MD

Expected results

ID     From           To           State
1      2003-09-05     2003-11-05   MD
1      2004-09-05     2007-05-05   TX
1      2007-06-05     2009-07-05   DC 
1      2011-11-05     2017-06-05   MD
dan1st
  • 12,568
  • 8
  • 34
  • 67
Science11
  • 788
  • 1
  • 8
  • 24
  • But then why do you collapse `2003-05-05 2004-08-05 OR` and `2004-09-05 2009-03-05 OR`? There are 31 days inbetween 2004-8-5 and 2004-9-5. – ekoam Jan 18 '22 at 02:28
  • @ekoam, excellent point, I have updated my question to reflect 31 days instead of 30 days. Thanks for catching this. – Science11 Jan 18 '22 at 02:40

1 Answers1

2

Subtract current From date from the previous To date and create a new grouping column and select first From value and last To value within each group.

library(dplyr)

df %>%
  mutate(across(c(From, To), as.Date)) %>%
  group_by(ID, State, 
           group = cumsum(From - dplyr::lag(To, default = as.Date('1970-01-01')) > 31)) %>%
  summarise(From = first(From), 
            To = last(To), .groups = 'drop') %>%
  select(-group) %>%
  arrange(ID, From)

#     ID State From       To        
#  <int> <chr> <date>     <date>    
#1     1 MD    2004-04-05 2005-03-05
#2     1 DC    2005-04-05 2005-10-05
#3     1 DC    2006-03-05 2007-03-05
#4     1 MD    2007-04-05 2007-06-05
#5     1 MD    2008-03-05 2012-11-05
#6     2 OR    2003-05-05 2009-03-05
#7     2 AZ    2010-06-05 2010-08-05
#8     2 AZ    2013-06-05 2015-06-05
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • the proposed solution works for Testcase1, it does not work if the states have a gap more than 31 days and repeat again, it does not work for Testcase2. – Science11 Jan 18 '22 at 15:27