1

I'm having a similar issue to the question listed here lubridate adds a century and I'm not having much success with the answer provided. The issue is that this is returning a list, and when trying to put it into the data frame is returning negative values and erasing the dmy format.

What I'm trying to do is to 1) reverse any inappropriate adding of centuries, 2) return that value as a vector back in the dataframe, and 3) understand exactly what I used to do it.

I am using lubridate 1.9.2, and R version 4.2.3. I have the following:

some_dates = c("1/1/63", "1/1/94", "1/1/65", "1/1/01", "1/1/86", "1/1/61", "1/1/71", "1/1/69", "1/1/86", "1/1/83", "1/1/94", "1/1/57", "1/1/79", "1/1/83", "1/1/01", "1/1/55", "1/1/77", "1/1/77", "1/1/77", "1/1/90")

twenty_later = c("1/1/84", "1/1/04", "1/1/85", "1/1/21", "1/1/06", "1/1/81", "1/1/91", "1/1/89", "1/1/06", "1/1/03", "1/1/14", "1/1/77", "1/1/99", "1/1/03", "1/1/21", "1/1/75", "1/1/97", "1/1/97", "1/1/97", "1/1/10")

df <- data.frame(some_dates, twenty_later)


df <- df |>
     mutate(
       some_dates_clean = dmy(some_dates),
       twenty_later_clean = dmy(twenty_later)
     )

Which shows 1/1/57 as 2057 instead of 1957. From the earlier question there was a function:

adjustCentury <- function(d, threshold=1930){
  y <- year(d) %% 100
  if(y > threshold %% 100) year(d) <- 1900 + y
  d
}

but when I apply it

df$some_dates_clean2 <- lapply(df$some_dates_clean, adjustCentury)

It gives the wrong output. If I use the following:

lapply(df$some_dates_clean, adjustCentury)

it creates a list, and I can unlist() but it returns an unwanted format.

The approach listed here str_replace(some_dates, '[0-9]+$', '19\\0')

Takes values that should be 2001 and turns them into 1901. I'm not great at regex and while I do understand that the first argument is looking for values between 0-9, I'm not sure how that's interpreted in the second argument 19\\0 even after going to a live demo on regex101.com.

This approach works but it's not clear at all why, according to the poster, this fails in 2057.

future_dates <- year(some_dates) > year(Sys.Date())
year(dates[future_dates]) <- year(dates[future_dates]) - 100

Thanks in advance. I don't have the reputation to comment on these posts and ask follow up questions and the instructions are a bit too opaque for my level of understanding.

plover
  • 47
  • 1
  • 6

1 Answers1

3

Try with lubridate::parse_date_time2(); it allows you to set cutoff_2000 argument to define which 2-digit years are parsed as post-year-2000:

library(dplyr)
library(lubridate)
some_dates = c("1/1/63", "1/1/94", "1/1/65", "1/1/01", "1/1/86", "1/1/61", "1/1/71", "1/1/69", "1/1/86", "1/1/83", "1/1/94", "1/1/57", "1/1/79", "1/1/83", "1/1/01", "1/1/55", "1/1/77", "1/1/77", "1/1/77", "1/1/90")
twenty_later = c("1/1/84", "1/1/04", "1/1/85", "1/1/21", "1/1/06", "1/1/81", "1/1/91", "1/1/89", "1/1/06", "1/1/03", "1/1/14", "1/1/77", "1/1/99", "1/1/03", "1/1/21", "1/1/75", "1/1/97", "1/1/97", "1/1/97", "1/1/10")
df <- data.frame(some_dates, twenty_later)

df |>
  mutate(
    some_dates_clean = parse_date_time2(some_dates, "dmy", cutoff_2000 = 30),
    twenty_later_clean = parse_date_time2(twenty_later,"dmy", cutoff_2000 = 30),
    diff = twenty_later_clean - some_dates_clean
  ) |> 
  arrange(diff) |>
  head()
#>   some_dates twenty_later some_dates_clean twenty_later_clean      diff
#> 1     1/1/94       1/1/04       1994-01-01         2004-01-01 3652 days
#> 2     1/1/65       1/1/85       1965-01-01         1985-01-01 7305 days
#> 3     1/1/01       1/1/21       2001-01-01         2021-01-01 7305 days
#> 4     1/1/86       1/1/06       1986-01-01         2006-01-01 7305 days
#> 5     1/1/61       1/1/81       1961-01-01         1981-01-01 7305 days
#> 6     1/1/71       1/1/91       1971-01-01         1991-01-01 7305 days

# vs original:
df |>
  mutate(
    some_dates_clean = dmy(some_dates),
    twenty_later_clean = dmy(twenty_later),
    diff = twenty_later_clean - some_dates_clean
  ) |>
  arrange(diff) |>
  head()
#>   some_dates twenty_later some_dates_clean twenty_later_clean        diff
#> 1     1/1/65       1/1/85       2065-01-01         1985-01-01 -29220 days
#> 2     1/1/61       1/1/81       2061-01-01         1981-01-01 -29220 days
#> 3     1/1/57       1/1/77       2057-01-01         1977-01-01 -29220 days
#> 4     1/1/55       1/1/75       2055-01-01         1975-01-01 -29220 days
#> 5     1/1/63       1/1/84       2063-01-01         1984-01-01 -28855 days
#> 6     1/1/94       1/1/04       1994-01-01         2004-01-01   3652 days

Created on 2023-07-19 with reprex v2.0.2

margusl
  • 7,804
  • 2
  • 16
  • 20
  • thanks for the response: can you explain what the 30 does? I'm looking at the documentation now and it's saying "two digit years smaller than the cutoff" but both 57 and 94 are larger than 30, even though 94 didn't have an issue in the original code. And in the documentation it's calling for " 68L" which is also unclear. Thanks again. – plover Jul 19 '23 at 18:18
  • 1
    _/../ two-digit numbers smaller or equal to cutoff_2000 are parsed as though starting with 20/ ../_ - so 00..30 are parsed as 2000 ... 2030, while 57 & 94 are larger than cutoff (here 30) and are parsed as 1957 & 1994 – margusl Jul 19 '23 at 18:25
  • OK I think I get it -- it didn't matter what the original code did, the fix takes anything (YY30) and later and makes it (19YY). So I can ignore the 68L (whatever that means). Thanks! – plover Jul 19 '23 at 18:37
  • 1
    In default value, L after 68 means it's integer constant, 68 is the same cutoff, for `YY <= 68` we get 20YY and for `YY > 68` we'll end up with 19YY. You can test it with `lubridate::parse_date_time2(c("1/1/68", "1/1/69"), "dmy")` – margusl Jul 19 '23 at 18:44
  • 1
    I really appreciate that explanation. Thank you so much! – plover Jul 19 '23 at 18:46