1

I have a field of dates that mark the end of weeks. I am trying to create a new field using a function that flags (1 or 0) whether that week included any of 6 holidays specified using the timeDate package. I am getting the following error: "Error during wrapup: comparison (3) is possible only for atomic and list types" - how can I resolve?

The function should take the end-of-week date (x) in the format yyyy-mm-dd (e.g., 2017-01-01) and the year of that date (y) in the format yyyy (e.g., 2017).

library(lubridate)
library(timeDate)

Date = as.Date(c("2017-01-01", "2017-01-08", "2017-01-15", "2017-01-22", "2017-06-04", "2017-07-09", "2017-07-16"))
Year = year(Date)
Holiday.During.Week = as.Date(c("2017-01-01", NA, NA, NA, "2017-05-29", "2017-07-04", NA))
Desired.Output = c(1,0,0,0,1,1,0)
data <- data.frame(Date, Year, Holiday.During.Week, Desired.Output)
data

holiday.function = function(x, y) {
  return(
    as.numeric(
      (USNewYearsDay(y) < x & USNewYearsDay(y) > (x - 7)) +
      (USMemorialDay(y) < x & USMemorialDay(y) > (x - 7)) +
      (USIndependenceDay(y) < x & USIndependenceDay(y) > (x - 7)) +
      (USLaborDay(y) < x & USLaborDay(y) > (x - 7)) +
      (USThanksgivingDay(y) < x & USThanksgivingDay(y) > (x - 7)) +
      (USChristmasDay(y) < x & USChristmasDay(y) > (x - 7))
    )
  )
}
data$Holiday.Flag = holiday.function(data$Date, data$Year)

Edit: thanks to Ian Campbell for working on this without provided data. I've updated the code to include a sample data frame and libraries

Nick B
  • 13
  • 3
  • Per `r` tag (hover or click to see): Please provide minimal and [reproducible example(s)](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5965451) along with the desired output. Use `dput()` for data and specify all non-base packages with `library()` calls. – Parfait Apr 13 '20 at 19:17

1 Answers1

0

Sounds like you have a lot of data, so let's use data.table. I generated a random sample of 10,000,000 days at the end of this answer.

First, we'll make a data.table of all of the holidays between 1900 and 2020.

library(timeDate)
library(data.table)
library(lubridate)
HolidayTable <- rbindlist(lapply(1900:2020,function(y){data.frame(Year = y, Holiday = as.Date(c(USNewYearsDay(y),USMemorialDay(y),USIndependenceDay(y),USLaborDay(y),USThanksgivingDay(y),USChristmasDay(y))))}))

We need to make a copy of the holiday date, because data.table rolling joins merge the column you join on.

setDT(test.data)
setDT(HolidayTable)
HolidayTable[,Date := Holiday]
test.data[, Year := year(Date)]

Now we do a rolling join with roll = 6 to join on dates that are at most 6 days in the future. We can then create a the desired output with a logical comparison coerced to integer with +.

HolidayTable[test.data, on = c("Year","Date"), roll = 6][
  ,.(Index,Year,Date,Holiday,HolidayPresent = +(!is.na(Holiday)))]

#             Index Year       Date    Holiday HolidayPresent
#       1:        1 2018 2018-04-21       <NA>              0
#       2:        2 2017 2017-09-30       <NA>              0
#       3:        3 2017 2017-01-07 2017-01-01              1
#       4:        4 2017 2017-08-26       <NA>              0
#       5:        5 2018 2018-09-01       <NA>              0
#      ---                                                   
# 9999996:  9999996 2017 2017-06-24       <NA>              0
# 9999997:  9999997 2018 2018-03-17       <NA>              0
# 9999998:  9999998 2018 2018-07-07 2018-07-04              1
# 9999999:  9999999 2018 2018-01-13       <NA>              0
#10000000: 10000000 2017 2017-08-12       <NA>              0

10,000,000 rows done in just 2.5 seconds on my laptop.

system.time({HolidayTable[test.data, on = c("Year","Date"), roll = 6][,.(Index,Year,Date,Holiday,HolidayPresent = +(!is.na(Holiday)))]})
   user  system elapsed 
  2.045   0.426   2.484 

Data

library(zoo)
WeekEndingDate2017 <- zoo::as.Date(Reduce(function(x,y){x + days(7)},1:51,as.Date("2017-01-07","%Y-%m-%d"), accumulate = TRUE))
WeekEndingDate2018 <- zoo::as.Date(Reduce(function(x,y){x + days(7)},1:51,as.Date("2018-01-06","%Y-%m-%d"), accumulate = TRUE))
set.seed(1)
test.data <- data.frame(Index = 1:10000000, Date = sample(c(WeekEndingDate2017,WeekEndingDate2018),size = 10000000, replace = TRUE))
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • Thanks so much for the help! The dataset I'm using has 1.7M records and 100+ columns. I tried running your code with this call `apply(data,1,function(x)holiday.function(x["Date"],x["Year"]))` However it's taking forever to run (going on 45 min). Any suggestions for performance improvement? Note: I tweaked x[1] and x[2] in your code to specify which columns in the dataset to use as inputs. – Nick B Apr 14 '20 at 00:07
  • You could use a `data.table` merge on a precomputed holiday table. What range of years does your data cover? – Ian Campbell Apr 14 '20 at 02:21
  • 2017-2020. I think I see what you're saying but want to confirm: create a table with one column for all the dates in the year for each year and another column with the holiday flags (0, 1), then merge/join that table with the main dataset using date as the key? – Nick B Apr 14 '20 at 16:59
  • I only did the dates that end weeks with holidays, less comparisons. Those that don't match get NA, which I replaced in the next line with 0. I also included a join on year in addition to date, which I suspect, but am not sure increases efficiency. – Ian Campbell Apr 14 '20 at 17:01