0

I would like a function that takes three columns of a data frame with the year, month, and day and constructs a valid date object. I would like the behavior an syntax to be like the Excel DATE function. Building on the answer to another question, I have the following function:

ExcelDate <- function(yearCol, monthCol, dayCol) {
    # Syntax: DATE(year,month,day)
    # Requires lubridate and magrittr which are both in tidyverse
    ReturnCol = paste(yearCol, monthCol, dayCol, sep="-") %>% ymd() %>% as.Date()
    return(ReturnCol)
}

Note that Excel dates have an integer numbering system relative to 1900 that I'm not looking to duplicate, I just want the year / month / day input format and the output date column formated as a date object.

However, the Excel function handles invalid inputs in a way that is very useful for date arithmetic. See the following picture for examples. Basically, the 0th day of the month is the last day of the prior month and the 0th month is the last month of the prior year. Any input t-days beyond the normal month length are converted into the t-th day of the following month (or year if needed.

There are lots of fiddly edge cases here (different month lengths, leap years, inputs that shift over multiple months or years) that are making me reluctant to write such a thing from scratch. What's a vectorized way of accomplishing this that uses existing date libraries to make this easy and relatively fast?

Year (input) Month (input) Day (input) Date (output) Notes
2020 2 17 2/17/2020 Code handles this case
2020 2 30 3/1/2020 Too many days
2020 2 -1 1/30/2020 Too few days
2020 2 0 1/31/2020 Too few days
2020 13 17 1/17/2021 Too many months
2020 0 17 12/17/2019 Too few months
2020 2 67 4/7/2020 Way too many days
2020 2 670 12/1/2021 Way way too many days
2020 2 29 2/29/2020 Leap year
2021 2 29 3/1/2021 Not a leap year
2100 2 29 3/1/2100 Not a leap year (surprise!)

Here is sample code to load the data and the desired answer into R:

df =  data.frame(
  year  = c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2021, 2100), 
  month = c(2, 2, 2, 2, 13, 0, 2, 2, 2, 2, 2), 
  day   = c(17, 30, -1, 0, 17, 17, 67, 670, 29, 29, 29), 
  date = mdy(c("2/17/2020", "3/1/2020", "1/30/2020", "1/31/2020", "1/17/2021", "12/17/2019", "4/7/2020", "12/1/2021", "2/29/2020", "3/1/2021", "3/1/2100"))
  )
BKay
  • 1,397
  • 1
  • 15
  • 26
  • I would suggest you construct the date using proper `year()`, `month()`, and `day()` functions from `lubridate` and not `paste0` – Matias Andina Oct 06 '21 at 19:50
  • @akrun, It doesn't handle any of the special cases. The ymd function from lubridate used as a parser only handles valid dates like the first case. – BKay Oct 06 '21 at 19:51
  • 1
    @akrun, now formatted as a table. – BKay Oct 06 '21 at 19:53
  • @BKay How can I load the table into R for testing? – dario Oct 06 '21 at 19:54
  • Basically what you wanted is a date parser with invalid days, checked with month and year, (leapyear). If it is greater or lesser, then add the days – akrun Oct 06 '21 at 20:00
  • i.e. probabably something like `library(lubridate); with(df1, ymd(Year, truncated = 2) %m+% months(Month) %m+% days(Day))` – akrun Oct 06 '21 at 20:03
  • @dario, I have added sample data. – BKay Oct 06 '21 at 20:07

1 Answers1

0

Building on @akrun's comment (and tweaking to make it work), I get:

df =  data.frame(
  year  = c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2021, 2100), 
  month = c(2, 2, 2, 2, 13, 0, 2, 2, 2, 2, 2), 
  day   = c(17, 30, -1, 0, 17, 17, 67, 670, 29, 29, 29), 
  date = mdy(c("2/17/2020", "3/1/2020", "1/30/2020", "1/31/2020", "1/17/2021", "12/17/2019", "4/7/2020", "12/1/2021", "2/29/2020", "3/1/2021", "3/1/2100"))
  )
df$candidate_date = with(df, ymd(year, truncated = 2) %m+% months(month-1) %m+% days(day-1)) 
df$candidate_date == df$date

Which returns:

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Or, using a function like in my example above:

ExcelDate <- function(yearCol, monthCol, dayCol) {
  #Syntax: DATE(year,month,day)
  # Requires lubridate 
  ReturnCol = ymd(yearCol, truncated = 2) %m+% months(monthCol-1) %m+% days(dayCol-1) 
  return(ReturnCol)
}

ExcelDate(df$year, df$month, df$day) == df$date

Which also returns:

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

BKay
  • 1,397
  • 1
  • 15
  • 26