Count bizdays that intersect between lubridate intervals in R

Question

I have a dataset with ~ 330 000 rows. Each observation represents a period where an individual recieved a welfare benefit called "care allowance". The benefit is meant to replace income when the recipient has to be absent from work in order to care for their child full-time due to serious illness or to accompany them to a specialist healthcare institution.

There was a change in legislation regarding the welfare benefit in 2017, and one of my research questions concerns changes in the size and composition of the recipient population. My dataset contains information regarding regarding each case of benefit reception from Jan 1st 2016 to Dec 31 2021.

I want to portray the development in the amount of work days that have been compensated by the care allowance scheme over time. In many cases a period of care allowance reception can span years. I want to count the number of business days (e.g monday through friday) in the interval from the start date and end date of the reception period that falls within each of the years from 2016 to 2021.

I am only able to get the count of ordinary days for each year. I would be very appreciative of suggestions on how to modify my code so that df$bdays == df$days and the vars(days16:days21) count the number of business days instead.

Update

@Marcus' suggestion works well enough on a small dataset, but takes an unwieldy amount of time to execute on my larger dataset (over an hour and a half). I've come up with a solution using purrr::map2_dbl()

Original code:

library(bizdays)
library(lubridate)
library(dplyr)


id <- sort(sample(1:100, 1000, replace = T))
start_date <- sample(seq(ymd("2016-01-01"), ymd("2021-12-30"), by="day"), 1000)
end_date <- sample(seq(ymd("2016-01-01"), ymd("2021-12-31"), by="day"), 1000)

df <- data.frame(id, start_date, end_date) %>%
  filter(end_date > start_date) %>%
  mutate(interval = interval(start = start_date, end = end_date))



df <- df %>%
  mutate(days16 = as.period(intersect(interval, interval(ymd("2016-01-01"), ymd("2016-12-31"))))%/%days(1),
         days17 = as.period(intersect(interval, interval(ymd("2017-01-01"), ymd("2017-12-31"))))%/%days(1),
         days18 = as.period(intersect(interval, interval(ymd("2018-01-01"), ymd("2018-12-31"))))%/%days(1),
         days19 = as.period(intersect(interval, interval(ymd("2019-01-01"), ymd("2019-12-31"))))%/%days(1),
         days20 = as.period(intersect(interval, interval(ymd("2020-01-01"), ymd("2020-12-31"))))%/%days(1),
         days21 = as.period(intersect(interval, interval(ymd("2021-01-01"), ymd("2021-12-31"))))%/%days(1))

df[is.na(df)] <- 0 

cal <- create.calendar(name = "mycal", weekdays=c("saturday", "sunday"))


df <- df %>%
  mutate(days = days16 + days17 + days18 + days19 + days20 + days21) %>%
  mutate(bdays = bizdays(start_date, end_date, cal)) %>%
  arrange(id, start_date)

head(df, n = 10)
#>    id start_date   end_date                       interval days16 days17 days18
#> 1   1 2016-03-15 2017-04-20 2016-03-15 UTC--2017-04-20 UTC    289    110      0
#> 2   1 2016-07-10 2018-12-14 2016-07-10 UTC--2018-12-14 UTC    173    364    347
#> 3   1 2018-03-06 2021-01-11 2018-03-06 UTC--2021-01-11 UTC      0      0    298
#> 4   1 2018-09-01 2019-04-21 2018-09-01 UTC--2019-04-21 UTC      0      0    121
#> 5   2 2016-04-27 2019-04-28 2016-04-27 UTC--2019-04-28 UTC    247    364    364
#> 6   2 2016-08-13 2019-09-10 2016-08-13 UTC--2019-09-10 UTC    139    364    364
#> 7   2 2016-10-03 2017-10-05 2016-10-03 UTC--2017-10-05 UTC     88    277      0
#> 8   2 2018-05-12 2018-07-17 2018-05-12 UTC--2018-07-17 UTC      0      0     65
#> 9   2 2019-08-29 2021-10-11 2019-08-29 UTC--2021-10-11 UTC      0      0      0
#> 10  2 2019-10-08 2020-08-05 2019-10-08 UTC--2020-08-05 UTC      0      0      0
#>    days19 days20 days21 days bdays
#> 1       0      0      0  399   287
#> 2       0      0      0  884   634
#> 3     364    364     10 1036   744
#> 4     111      0      0  232   164
#> 5     118      0      0 1093   782
#> 6     252      0      0 1119   801
#> 7       0      0      0  365   263
#> 8       0      0      0   65    46
#> 9     123    364    283  770   552
#> 10     83    217      0  300   216

^{Created on 2022-09-30 by the reprex package (v2.0.1)}

score 0 · Accepted Answer · answered Sep 30 '22 at 18:27

I would apply the bizdays function rowwise to each entry (warning this might take a while to run). This allows you to take either the start/end date or the start/end of the year to define the arguments to bizdays. Also move up your definition of calendar, setting it to financial = FALSE. Otherwise, if the last day of the year falls on a workday (when tallying the years), it won't get counted.

cal <- create.calendar(name = "mycal", weekdays=c("saturday", "sunday"), financial = FALSE)

df <- df %>%
  rowwise() |>
  mutate(
    days16 = bizdays(max(start_date, ymd("2016-01-01")), min(end_date, ymd("2016-12-31")), cal),
    days17 = bizdays(max(start_date, ymd("2017-01-01")), min(end_date, ymd("2017-12-31")), cal),
    days18 = bizdays(max(start_date, ymd("2018-01-01")), min(end_date, ymd("2018-12-31")), cal),
    days19 = bizdays(max(start_date, ymd("2019-01-01")), min(end_date, ymd("2019-12-31")), cal),
    days20 = bizdays(max(start_date, ymd("2020-01-01")), min(end_date, ymd("2020-12-31")), cal),
    days21 = bizdays(max(start_date, ymd("2021-01-01")), min(end_date, ymd("2021-12-31")), cal)
  )

df[is.na(df) | df < 0] <- 0 

df <- df %>%
  mutate(days = days16 + days17 + days18 + days19 + days20 + days21) %>%
  mutate(bdays = bizdays(start_date, end_date, cal)) %>%
  arrange(id, start_date)

df |> 
  as.data.frame() |> 
  head(n = 10)
#>    id start_date   end_date                       interval days16 days17 days18
#> 1   1 2017-02-06 2017-04-03 2017-02-06 UTC--2017-04-03 UTC      0     41      0
#> 2   1 2017-07-18 2018-05-27 2017-07-18 UTC--2018-05-27 UTC      0    119    105
#> 3   1 2019-02-06 2019-12-26 2019-02-06 UTC--2019-12-26 UTC      0      0      0
#> 4   1 2019-04-29 2020-02-15 2019-04-29 UTC--2020-02-15 UTC      0      0      0
#> 5   2 2016-01-07 2018-08-05 2016-01-07 UTC--2018-08-05 UTC    257    260    155
#> 6   2 2016-02-22 2016-11-17 2016-02-22 UTC--2016-11-17 UTC    194      0      0
#> 7   2 2016-12-04 2021-05-19 2016-12-04 UTC--2021-05-19 UTC     20    260    261
#> 8   2 2018-08-28 2020-09-26 2018-08-28 UTC--2020-09-26 UTC      0      0     90
#> 9   3 2016-10-21 2017-10-24 2016-10-21 UTC--2017-10-24 UTC     51    212      0
#> 10  3 2017-02-08 2021-07-04 2017-02-08 UTC--2021-07-04 UTC      0    233    261
#>    days19 days20 days21 days bdays
#> 1       0      0      0   41    41
#> 2       0      0      0  224   224
#> 3     232      0      0  232   232
#> 4     177     33      0  210   210
#> 5       0      0      0  672   672
#> 6       0      0      0  194   194
#> 7     261    262     99 1163  1163
#> 8     261    193      0  544   544
#> 9       0      0      0  263   263
#> 10    261    262    131 1148  1148

all(df3$days == df3$bdays)
#> [1] TRUE

Thank you Marcus for this nifty solution to my problem! I also appreciate the additional input regarding the calender argument. However (as you warned about) it sure takes a while to run (15 minutes now and counting), so I haven't actually verified that it does the job yet. — mgronas_werring, Oct 03 '22 at 06:48

mgronas_werring · Answer 2 · 2022-10-05T07:56:35.863

I managed to solve it using purrr::map2_dbl(). The function allows one to specify the expected output and the processing time goes way down to less than a minute:

library(bizdays)
library(lubridate)
library(dplyr)
library(purrr)

id <- sort(sample(1:100, 1000, replace = T))
start_date <- sample(seq(ymd("2016-01-01"), ymd("2021-12-30"), by="day"), 1000)
end_date <- sample(seq(ymd("2016-01-01"), ymd("2021-12-31"), by="day"), 1000)

cal <- create.calendar(name = "mycal", weekdays=c("saturday", "sunday"), financial = FALSE)

df <- data.frame(id, start_date, end_date) %>%
  filter(end_date > start_date) %>%
  mutate(interval = interval(start = start_date, end = end_date))


df <- df %>%
  mutate(days16 = bizdays(as_date(map2_dbl(df$start_date, rep(ymd("2016-01-01"), nrow(df)), max)),
                          as_date(map2_dbl(df$end_date, rep(ymd("2016-12-31"), nrow(df)), min)), cal),
         days17 = bizdays(as_date(map2_dbl(df$start_date, rep(ymd("2017-01-01"), nrow(df)), max)),
                          as_date(map2_dbl(df$end_date, rep(ymd("2017-12-31"), nrow(df)), min)), cal),
         days18 = bizdays(as_date(map2_dbl(df$start_date, rep(ymd("2018-01-01"), nrow(df)), max)),
                          as_date(map2_dbl(df$end_date, rep(ymd("2018-12-31"), nrow(df)), min)), cal),
         days19 = bizdays(as_date(map2_dbl(df$start_date, rep(ymd("2019-01-01"), nrow(df)), max)),
                          as_date(map2_dbl(df$end_date, rep(ymd("2019-12-31"), nrow(df)), min)), cal),
         days20 = bizdays(as_date(map2_dbl(df$start_date, rep(ymd("2020-01-01"), nrow(df)), max)),
                          as_date(map2_dbl(df$end_date, rep(ymd("2020-12-31"), nrow(df)), min)), cal),
         days21 = bizdays(as_date(map2_dbl(df$start_date, rep(ymd("2021-01-01"), nrow(df)), max)),
                          as_date(map2_dbl(df$end_date, rep(ymd("2021-12-31"), nrow(df)), min)), cal))

df[is.na(df) | df < 0] <- 0 


df <- df %>%
  mutate(days = days16 + days17 + days18 + days19 + days20 + days21) %>%
  mutate(bdays = bizdays(start_date, end_date, cal)) %>%
  arrange(id, start_date)

head(df, 10)
#>    id start_date   end_date                       interval days16 days17 days18
#> 1   1 2016-02-26 2016-04-12 2016-02-26 UTC--2016-04-12 UTC     33      0      0
#> 2   1 2016-03-12 2017-04-12 2016-03-12 UTC--2017-04-12 UTC    210     73      0
#> 3   1 2016-10-11 2018-08-12 2016-10-11 UTC--2018-08-12 UTC     59    260    160
#> 4   1 2017-03-23 2021-08-14 2017-03-23 UTC--2021-08-14 UTC      0    202    261
#> 5   1 2017-04-21 2021-08-05 2017-04-21 UTC--2021-08-05 UTC      0    181    261
#> 6   1 2017-05-21 2018-07-17 2017-05-21 UTC--2018-07-17 UTC      0    160    142
#> 7   1 2017-08-07 2019-10-03 2017-08-07 UTC--2019-10-03 UTC      0    105    261
#> 8   1 2018-02-28 2019-07-30 2018-02-28 UTC--2019-07-30 UTC      0      0    219
#> 9   1 2019-02-02 2019-09-17 2019-02-02 UTC--2019-09-17 UTC      0      0      0
#> 10  1 2020-02-16 2021-12-03 2020-02-16 UTC--2021-12-03 UTC      0      0      0
#>    days19 days20 days21 days bdays
#> 1       0      0      0   33    33
#> 2       0      0      0  283   283
#> 3       0      0      0  479   479
#> 4     261    262    161 1147  1147
#> 5     261    262    155 1120  1120
#> 6       0      0      0  302   302
#> 7     198      0      0  564   564
#> 8     151      0      0  370   370
#> 9     162      0      0  162   162
#> 10      0    229    241  470   470

all(df$days == df$bdays)
#> [1] TRUE

^{Created on 2022-10-05 by the reprex package (v2.0.1)}

Count bizdays that intersect between lubridate intervals in R

2 Answers2