0

I have a data table of three columns id, dtstart, dtend. For example:

id  start       end 
1  01/01/2015   31/01/2015
1  02/02/2015   28/02/2015
1  01/07/2016   31/07/2016
1  01/08/2016   31/08/2016
2  01/03/2015   31/03/2015
2  01/04/2015   30/04/2015
2  01/02/2016   28/02/2016
2  01/03/2016   31/03/2016
...

I need to create another data table grouped by id with the same columns but the new start date is the minimum date in the original start date and the new end date is the maximum date in the original dtend.

When there is a break of more then one day between an end date and the next start date then it should be grouped separately.

For example for the above the new table would be:

id    start       end
1     01/01/2015  28/02/2015
1     01/07/2016  31/08/2016
2     01/03/2015  30/04/2016
2     01/02/2016  31/03/2016
...

Do I need a for loop or is there a more efficient way (data table grouping for example)? The table is over 20 million rows with 100k+ unique ids.

Cheers Andrew

Andrew Scotchmer
  • 313
  • 3
  • 12
  • I think this may get you going: [Collapse rows with overlapping ranges](https://stackoverflow.com/questions/41747742/collapse-rows-with-overlapping-ranges) – Henrik Jun 12 '17 at 19:52

2 Answers2

0

This can be done using dplyr

dt.new <- dt %>%
          arrange(id, start, end) %>%
          mutate(gr = cumsum(lag(id, default = min(id)) != id | 
                      as.numeric(difftime(start, lag(end, default = first(start)), units = 'days')) > 1)) %>%
          group_by(id, gr) %>%
          summarise(start = first(start),
                    end   = last(end))

The result is:

Source: local data frame [6 x 4]
Groups: id [?]

     id    gr      start        end
  <int> <int>     <dttm>     <dttm>
1     1     0 2015-01-01 2015-01-31
2     1     1 2015-02-02 2015-02-28
3     1     2 2016-07-01 2016-08-31
4     2     3 2015-03-01 2015-04-30
5     2     4 2016-02-01 2016-02-28
6     2     5 2016-03-01 2016-03-31    

This works and doesn't match your output because you requested a one day margin (if you want two day margins then switch from >1 to >2), and 2016 was a leap year, which is in R's internal calendar. So the margin between 2/28/2016 and 3/1/2016 is 2 days.

akash87
  • 3,876
  • 3
  • 14
  • 30
  • Thanks. The cumsum is just producing NAs for me. I've tried as.numeric(as.Date(start, format = "%Y-%m-%d")-lag(as.Date(end, format = "%Y-%m-%d"))>1) but the first grp entry is NA and so is included when grouping and summarising. – Andrew Scotchmer Jun 12 '17 at 21:30
  • You must add a `default` parameter in `lag()`. I updated my code for it. – akash87 Jun 12 '17 at 21:55
  • Thank you @akash87. Just one error where it separates into two when it should be one. For example if row 3 was: 1 05/02/2007 05/02/2007. This row is split out on its own. – Andrew Scotchmer Jun 13 '17 at 11:35
  • I am not getting that error, could you send an example? – akash87 Jun 13 '17 at 12:24
  • Will do @akash87, I put it in a reply to the original question with tabled examples. This is having me scratch my head. – Andrew Scotchmer Jun 13 '17 at 13:07
0

Thanks again @akash87

For example row 6 below is within a month so it should still return one row for id 1 from 1/02/2006 to 30/09/2006 but it breaks into two, the first from 01/02/2006 to 12/06/2006 and then from 01/07/2006 to 30/09/2016

id dtstart     dtend
1  01/02/2006  28/02/2006
1  01/03/2006  31/03/2006
1  01/04/2006  30/04/2006
1  01/05/2006  31/05/2006
1  01/06/2006  30/06/2006
1  10/06/2006  12/06/2006
1  01/07/2006  31/07/2006
1  01/08/2006  31/08/2006
1  01/09/2006  30/09/2006
2  01/04/2006  30/04/2006
2  01/05/2006  31/05/2006
2  01/09/2006  30/09/2006
2  01/10/2006  31/10/2006

So instead of returning

id start       end
1  01/02/2006  30/09/2006
2  01/04/2006  31/05/2006
2  01/09/2006  31/10/2006

We have

id start       end
1  01/02/2006  12/06/2006
1  01/07/2006  30/09/2006
2  01/04/2006  31/05/2006
2  01/09/2006  31/10/2006

Andrew

Andrew Scotchmer
  • 313
  • 3
  • 12
  • AHHH this poses a different problem: You are looking for date ranges within other date ranges in the same column. – akash87 Jun 13 '17 at 13:25
  • A not vert elegant why I suppose is to run it once more over the new table replacing summarise(... end = last(end)) with summarise(... end = max(end))??? – Andrew Scotchmer Jun 13 '17 at 20:00
  • The date range condition I've got with cumsum(lag(id, default = min(id)) != id | !(dtstart %in% min(dtstart):max(dtend))) but then adding the difftime clause overwrites it. – Andrew Scotchmer Jun 14 '17 at 10:06