1

If I have a dataframe with 2 columns which are YMD HMS, how do I calculate the difference in seconds between the two excluding weekends?

col 2 - col 1 = time in seconds; need to exclude the weekend seconds

Dates1 <- as.POSIXct("2011-01-30 12:00:00") + rep(0, 10)
Dates2 <- as.POSIXct("2011-02-04") + seq(0, 9, 1)
df <- data.frame(Dates1 = Dates1, Dates2 = Dates2)

I need it to give me (388800 - 43200) = 345600; The reason why I am subtracting 43200 is because that is a Sunday weekend time from Noon until Midnight which the clock stops.

user_1771
  • 47
  • 7
  • 1
    Can you provide some sample data? That would make it much easier to test that our answers fit the bill before submitting an answer. ;) – Benjamin May 22 '19 at 18:44

2 Answers2

2

Here's a solution using lubridate and other tidyverse packages. The nice thing about lubridate is that it will handle pretty seamlessly a host of oddball issues with time, from time zones to leap years to the switch to and from daylight saving time. (If you care about those, just make sure your data has time zones.)

The concept I'm using here is that of intervals in lubridate (created using the %--% operator). An interval is literally what it sounds like: a very useful class that basically has a start datetime and an end datetime.

I generate two datasets: one for your start and end times, and another for weekend start and end times, each with its own interval column. In the weekend dataset, note that the start and end times are arbitrarily set to a year of Saturdays and Sundays. You should set those with values that make sense for you, or work out a way to set it from the data. :)

From there, we'll find the overlap between your intervals and the weekend intervals using lubridate's intersect function, so later we can count up the relevant weekend seconds and subtract them out.

But first we use crossing from tidyr to make sure we're checking every one of your intervals against every weekend in the weekends dataset. It just runs a Cartesian product of the two datasets (see this SO answer).

Finally we use int_length to count up the weekend seconds, sum up the weekend seconds for each of your intervals, count up the total seconds for each, and subtract the weekend seconds from the total seconds. And voila! We have total seconds, excluding weekends.

The other nice thing about this solution is that it's extremely flexible. I've defined weekends as 0:00 Saturday to 0:00 Monday... but you could remove Friday evenings, Monday wee hours, whatever strikes your fancy and meets your analytical requirements.

library(dplyr)
library(tidyr)
library(tibble)
library(lubridate) # makes dates and times easier!

test <- tribble(
            ~start_time,             ~end_time,
  "2019-05-22 12:35:42", "2019-05-23 12:35:42", # same week no weekends
  "2019-05-22 12:35:42", "2019-05-26 12:35:42", # ends during weekend
  "2019-05-22 12:35:42", "2019-05-28 12:35:42", # next week full weekend
  "2019-05-26 12:35:42", "2019-05-29 12:35:42", # starts during weekend
  "2019-05-22 12:35:42", "2019-06-05 12:35:42"  # two weeks two weekends
) %>% 
  mutate(
    id = row_number(),
    timespan = start_time %--% end_time
  )

weekend_beginnings <- ymd_hms("2019-05-18 00:00:00") + weeks(0:51)
weekend_endings <- ymd_hms("2019-05-20 00:00:00") + weeks(0:51)
weekends <- weekend_beginnings %--% weekend_endings

final_answer <- crossing(test, weekends) %>% 
  mutate(
    weekend_intersection = intersect(timespan, weekends),
    weekend_seconds = int_length(weekend_intersection)
  ) %>% 
  group_by(id, start_time, end_time, timespan) %>% 
  summarise(
    weekend_seconds = sum(weekend_seconds, na.rm = TRUE)
  ) %>% 
  mutate(
    total_seconds = int_length(timespan),
    weekday_seconds = total_seconds - weekend_seconds
  )

glimpse(final_answer)
Benjamin
  • 876
  • 4
  • 8
  • Are you able to show *just* packages that are required for this answer? Not everybody has all of `tidyverse` installed, whether by policy or preference. It's my opinion that answers should be explicit when listing required packages. (And `tidyverse` already imports `lubridate` anyway.) – r2evans May 22 '19 at 21:23
  • 1
    I've edited the answer. Your parenthetical is incorrect, though. Running `library(tidyverse)` does not attach `lubridate`. Running `install.packages(tidyverse)` installs it (along with a host of other packages), but `library(tidyverse)` only attaches the core packages, which (oddly I'll admit) does not include `lubridate`. See https://www.tidyverse.org/packages/. – Benjamin May 22 '19 at 21:35
  • 1
    Ahh, I see now ... I had always assumed that `tidyverse` was a meta package just for attaching all related packages ... now I see the definition and use of [`core` packages](https://github.com/tidyverse/tidyverse/blob/master/R/attach.R#L1). Thanks for the correction. – r2evans May 22 '19 at 21:41
  • 1
    adjusted this code a little bit and works flawlessly; just subtracted the weeks by a lot instead of + (0:51) to get all historical information, then converted the UTC dates to PST dates and ran the function and got the values! – user_1771 May 22 '19 at 22:16
  • @user_1771 awesome! glad I could help. – Benjamin May 22 '19 at 22:54
  • 1
    is there any way to make this more efficient? I am getting an error of "errors cannot allocate vector of 3.1 GB" when using this crossing on the full data set (works flawlessly on a sample of 100) – user_1771 Jun 06 '19 at 22:12
  • ooh, ouch @user_1771, let me think about that one. – Benjamin Jun 06 '19 at 22:23
  • thanks Ben! I tried the solution as well below but it gets throw off in displaying minutes sometimes, hours other times, etc – user_1771 Jun 06 '19 at 23:00
1

Here's a cut that works on vectors:

#' Seconds difference without weekends
#'
#' @param a, b POSIXt
#' @param weekends 'character', day of the week (see
#'   [base::strptime()] for the "%w" argument), "0" is Sunday, "6" is
#'   Saturday; defaults to `c("0","6")`: Saturday and Sunday
#' @param units 'character', legal values for [base::units()], such as
#'   "secs", "mins", "hours"
#' @return 'difftime' object
#' @md
secs_no_weekend <- function(a, b, weekends = c("0", "6"), units = "secs") {
  out <- mapply(function(a0, b0) {
    astart <- as.POSIXct(format(a0, "%Y-%m-%d 00:00:00"))
    aend <- as.POSIXct(format(a0, "%Y-%m-%d 24:00:00"))
    bstart <- as.POSIXct(format(b0, "%Y-%m-%d 00:00:00"))
    days <- seq.POSIXt(astart, bstart, by = "day")
    ndays <- length(days)
    if (ndays == 1) {
      d <- b0 - a0
      units(d) <- "secs"
    } else {
      d <- rep(60 * 60 * 24, ndays) # secs
      d[1] <- `units<-`(aend - a0, "secs")
      d[ndays] <- `units<-`(b0 - bstart, "secs")
      wkend <- format(days, "%w")
      d[ wkend %in% weekends ] <- 0
    }
    sum(pmax(0, d))
  }, a, b)
  out <- structure(out, class = "difftime", units = units)
  out
}

Testing/validation:

Perhaps this will be updated as examples come in that do not match my assumptions.

For perspective, here is this month's (June 2019) calendar, in ISO-8601 (right) and US/not-ISO (left):

week <- c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
# sunfirst <- ... calculated
monfirst <- tibble(dt = seq(as.Date("2019-06-01"), as.Date("2019-06-30"), by="days")) %>%
  mutate(
    dow = factor(format(dt, format = "%a"), levels = week),
    dom = as.integer(format(dt, format = "%e")),
    wom = format(dt, format = "%V") # %U for sunfirst, %V for monfirst
  ) %>%
  select(-dt) %>%
  spread(dow, dom) %>%
  select(-wom)
monfirst <- rbind(monfirst, NA)
cbind(sunfirst,   ` `="     ",        monfirst                   )
#   Sun Mon Tue Wed Thu Fri Sat       Mon Tue Wed Thu Fri Sat Sun
# 1  NA  NA  NA  NA  NA  NA   1        NA  NA  NA  NA  NA   1   2
# 2   2   3   4   5   6   7   8         3   4   5   6   7   8   9
# 3   9  10  11  12  13  14  15        10  11  12  13  14  15  16
# 4  16  17  18  19  20  21  22        17  18  19  20  21  22  23
# 5  23  24  25  26  27  28  29        24  25  26  27  28  29  30
# 6  30  NA  NA  NA  NA  NA  NA        NA  NA  NA  NA  NA  NA  NA

Some data and expectations. (I use dplyr here for simplicity/readability, the function above does not require it.)

dh <-  43200 # day-half, 60*60*12
d1 <-  86400 # day=1, 60*60*24
d4 <- 345600 # days=4, 4*d1
d5 <- 432000 # days=5
d7 <- 432000 # 7 days minus weekend
d <- tribble(
  ~x                   , ~y                   , ~expect, ~description
, "2019-06-03 12:00:00", "2019-06-03 12:00:05",      5 , "same day"
, "2019-06-03 12:00:00", "2019-06-04 12:00:05",   d1+5 , "next day"
, "2019-06-03 12:00:00", "2019-06-07 12:00:05",   d4+5 , "4d + 5"
, "2019-06-03 12:00:00", "2019-06-08 12:00:05",  d4+dh , "start weekday, end weekend, no 5"
, "2019-06-03 12:00:00", "2019-06-09 12:00:05",  d4+dh , "start weekday, end weekend+, no 5, same"
, "2019-06-03 12:00:00", "2019-06-10 12:00:05",   d7+5 , "start/end weekday, 1 full week"
, "2019-06-02 12:00:00", "2019-06-03 12:00:05",   dh+5 , "start weekend, end weekday, 1/2 day"
, "2019-06-02 12:00:00", "2019-06-08 12:00:05",     d7 , "start/end weekend, no 5"
) %>% mutate_at(vars(x, y), as.POSIXct)
(out <- secs_no_weekend(d$x, d$y))
# Time differences in secs
# [1]      5  86405 345605 388800 388800 432005  43205 432000
all(out == d$expect)
# [1] TRUE
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • sometimes this gives it in minutes, sometimes in hours - how do we always make it show in seconds? removing the if ndays = 1 throws everything off too – user_1771 Jun 06 '19 at 22:59
  • @user_1771, see my edit, I think that'll work. If this doesn't work, please edit your question with the conditions that produce different units. – r2evans Jun 07 '19 at 02:25
  • 1
    awesome! is super fast without Cartesian product, as long as Dates 2 is later than Dates 1 its perfect – user_1771 Jun 07 '19 at 16:53
  • 1
    if there are 2 weekend dates though, it will show up as negative. easy workaround is just to override these to 0. However it looks like if D1 is weekday and D2 is weekend it still computes the total time instead of weekday only, and same if D1 is weekend and D2 is weekday. any way around this? – user_1771 Jun 07 '19 at 18:36
  • Good catch, I just added a `pmax` to ensure no negatives. – r2evans Jun 07 '19 at 18:43
  • R must be defaulting date conversions or something: when giving exact date times: 2019-04-21 23:03:22 2019-04-22 03:04:25 11065 secs = what it should be ... but how its in the data: 2019-04-21 23:03:22 2019-04-22 03:04:25 0 secs both are POSIXct, if I convert the data to tz=PST it gives even stranger (21865 seconds). any idea? – user_1771 Jun 07 '19 at 19:41
  • 1
    Got it - just specific specific time zone in your function (i.e. astart <- as.POSIXct(format(a0, "%Y-%m-%d 00:00:00"),tz='America/Los_Angeles') all set! – user_1771 Jun 07 '19 at 20:09
  • 1
    Wow, this is beautiful @r2evans, I love to see the different ways to approach a problem. OP, I'll consider myself off the hook for optimizing my solution. ;) – Benjamin Jun 10 '19 at 16:19
  • yup off the hook! – user_1771 Jun 10 '19 at 20:26