1

I have a data frame which looks like the following:

id <- c("Joe" ,"Joe" ,"Joe" ,"Joe" ,"Joe")
work_start <- as.Date(c("2004-06-23", "2005-04-20", "2005-05-24", "2014-05-01", "2018-04-01"))
work_end <- as.Date(c("2014-04-30", "2010-03-11", "2005-07-05", "2018-03-31", "2019-03-31"))

df <- data.frame(id, work_start, work_end)

I want to calculate total days for Joe that he has been in work also factoring in consecutive days - so in the example above that would be a continuous period from 23 Jun 2004 to 31 Mar 2019 (as the gap between 30 Apr 2014 and 1 May 2014 is a consecutive day).

I am trying to do this by calculating the last available day and have part of a process below, but not sure how I write it into a loop within dplyr, or if this is even the right approach to take with this problem at all? Any help much appreciated.

library(dplyr)


df <- df %>%
  group_by(id) %>%
  arrange(id, work_start, work_end) %>%
  mutate(last_work_end = lag(work_end)) %>%
  mutate(last_work_end = if_else(lag(last_work_end) > last_work_end & is.finite(lag(last_work_end)),
                              lag(last_work_end),
                              last_work_end)) %>%
  mutate(last_work_end = if_else(lag(last_work_end) > last_work_end & is.finite(lag(last_work_end)),
                              lag(last_work_end),
                              last_work_end)) %>%
  ungroup()
TimL
  • 211
  • 2
  • 11
  • Isn't it enough to just calculate `work_end` – `work_start`, and then sum the differences for the total days worked? –  Mar 23 '20 at 13:05

2 Answers2

1

You could try:

library(dplyr)

df <- df %>%
  arrange(id, work_start, work_end) %>%
  group_by(id) %>%
  mutate(cumMaxDate = setattr(cummax(unclass(work_end)), "class", "Date")) %>%
  group_by(id, idx = cumsum(+(work_start > (lag(cumMaxDate, default = first(cumMaxDate)) + 1)))) %>%
  summarise(work_start = min(work_start), work_end = max(cumMaxDate), duration = difftime(work_end, work_start)) %>%
  ungroup() %>% select(-idx)

Output:

# A tibble: 1 x 4
  id    work_start work_end   duration 
  <fct> <date>     <date>     <drtn>   
1 Joe   2004-06-23 2019-03-31 5394 days

Note that if Joe had two non-consecutive periods, you would then need to group_by(id) again after the last ungroup and just do mutate(duration = sum(duration) or similar.

On the other hand, if you work a lot with this type of data (e.g. as produced in CRM or HCM systems), you can have a look at my package neatRanges. The above problem could then be solved as:

# install.packages('neatRanges')

library(dplyr) # Just for the purpose of using the pipes and `mutate`

df %>%
  neatRanges::collapse_ranges(., groups = 'id', start_var = 'work_start', end_var = 'work_end') %>%
  mutate(duration = difftime(work_end, work_start))

Output:

   id work_start   work_end  duration
1 Joe 2004-06-23 2019-03-31 5394 days

Note that the package is still in its early days, however at least the collapse_ranges function has been battle-tested to quite some extent - on the other hand, if you have any suggestions for improvement or find any bugs you're more than welcome to report on GitHub.

arg0naut91
  • 14,574
  • 2
  • 17
  • 38
  • 1
    Yep @arg0naut91 this works brilliantly for what I need it to do and makes my flapping around in dplyr look pretty sorry. I do frequently use datasets with lots of dates (and lots of NA dates!) so I will definitely check out neatRanges in future! Thanks again – TimL Mar 24 '20 at 12:14
1

Here is an option using data.table

library(data.table)
setDT(df)[order(id, work_start, work_end), 
    g := cumsum(work_start - 1L > shift(cummax(as.integer(work_end)), fill=0L)), id][,
        c("first_work_start","last_work_end") := .(min(work_start), max(work_end)), .(id, g)]

output:

    id work_start   work_end g first_work_start last_work_end
1: Joe 2004-06-23 2014-04-30 1       2004-06-23    2019-03-31
2: Joe 2005-04-20 2010-03-11 1       2004-06-23    2019-03-31
3: Joe 2005-05-24 2005-07-05 1       2004-06-23    2019-03-31
4: Joe 2014-05-01 2018-03-31 1       2004-06-23    2019-03-31
5: Joe 2018-04-01 2019-03-31 1       2004-06-23    2019-03-31

Reference: How to flatten / merge overlapping time periods

chinsoon12
  • 25,005
  • 4
  • 25
  • 35