1

After removing the leap day from my time series, I used format = %j to get the day of year DoY values. However, the last DoY value was still 366 rather than 365 because the DoY = 60 gets skipped which is where 1996-02-29 was. How can I get the correct day of year after removing the leap day from my time series?

similar StackOverflow question here

Example:

df <- data.frame(matrix(ncol = 2, nrow = 366))
x <- c("date", "DoY")
colnames(df) <- x
start = as.Date("1996-01-01")
end = as.Date("1996-12-31")
df$date <- seq.Date(start,end,1)
remove_leap <- as.Date(c("1996-02-29"))
df <- df[!df$date %in% remove_leap,]
df$DoY <- strftime(df$date, format = "%j") #this formats the date to DoY values but still *sees* the leap day giving a max DoY = 366 rather than 365
df$DoY <- as.numeric(df$DoY)
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
tassones
  • 891
  • 5
  • 18
  • 3
    366 *is* by definition the correct day of the year, even if you remove the leap date. If you want the day of the year, disregarding the leap year, you'll need something along the lines of `library(lubridate); DoY <- as.integer(strftime(df$date, format = "%j")); DoY <- DoY - ifelse(year(df$date) %% 4 == 0 & DoY > 60, DoY - 1, DoY)` (untested). Basically substracting 1 from the result if it is after the specific date. – Oliver Jun 17 '21 at 20:55
  • Sure but this answer breaks down rather quickly if the time series includes an additional year without a leap day. For example, if this was a two-year dataset with the year 1997 added. – tassones Jun 17 '21 at 21:06
  • @Oliver's first sentence correctly states the problem. Your question is ill-founded. You appear to be manipulating your data to fit your model. A better approach is to adapt your model to reflect the data. – Limey Jun 17 '21 at 21:27
  • @tassones the answer does not break on non-leap year. `year(df$date) %% 4 == 0` checks for a leap year, `DoY > 60` checks whether the date within the leap year is after the leap date. – Oliver Jun 17 '21 at 21:28
  • @Oliver Maybe I'm missing something obvious but when I use `library(lubridate); df$DoY <- as.integer(strftime(df$date, format = "%j")); DoY <- DoY - ifelse(year(df$date) %% 4 == 0 & DoY > 60, DoY - 1, DoY)` the DoY sequence still skips the number 60. – tassones Jun 17 '21 at 21:38

1 Answers1

2

I can take it from here and correct the DoY like this so it ends at 365:

library(dplyr)
library(lubridate)

df %>% 
  mutate(DoY = day(date),
         Month = month(date),
         Year = year(date)) %>% 
  group_by(Year, Month) %>%
  mutate(DoY = DoY - lag(DoY, default = 0)) %>%
  group_by(Year) %>%
  mutate(DoY = cumsum(DoY)) %>% 
  select(-Month) %>%
  slice_tail(n = 10)

# A tibble: 10 x 2
   date         DoY
   <date>     <dbl>
 1 1996-12-22   356
 2 1996-12-23   357
 3 1996-12-24   358
 4 1996-12-25   359
 5 1996-12-26   360
 6 1996-12-27   361
 7 1996-12-28   362
 8 1996-12-29   363
 9 1996-12-30   364
10 1996-12-31   365
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • 1
    this works well but seems to break down once the time series moves into the following year. If you extend the end date in the example to `end = as.Date("1997-01-03")` and re-run the analysis, the date `1997-01-01` shows up as DoY = 335 rather than 1 – tassones Jun 17 '21 at 22:12
  • 1
    @tassones I made a slight modification. Grouping them by `Year` and starting over the counting of the days each `Year`. Can you check this one please? – Anoushiravan R Jun 17 '21 at 22:18
  • For anyone who might stumble onto this page that has this same question but with multiple groups, here is how that would work: `df %>% mutate(DoY = day(date),Month = month(date),Year = year(date)) %>% group_by(Year, Month, GROUP_ID) %>% mutate(DoY = DoY - lag(DoY, default = 0)) %>% group_by(Year, GROUP_ID) %>% mutate(DoY = cumsum(DoY)) %>% select(-Month)`. Just replace `GROUP_ID` with your grouping variable. – tassones Jun 18 '21 at 01:22