0

I'm trying to parse dates (using lubridate functions) from a vector which has mixed date formats.

departureDate <- c("Aug 17, 2020 12:00:00 AM", "Nov 19, 2019 12:00:00 AM", "Dec 21, 2020 12:00:00 AM",
"Dec 24, 2020 12:00:00 AM", "Dec 24, 2020 12:00:00 AM", "Apr 19, 2020 12:00:00 AM", "28/06/2019",
"16/08/2019", "04/02/2019", "10/04/2019", "28/07/2019", "26/07/2019", 
"Jun 22, 2020 12:00:00 AM", "Apr 5, 2020 12:00:00 AM",  "May 1, 2021 12:00:00 AM")

As I didn't notice at first, I tried to parse with lubridate::mdy_hms(departureDate) which resulted in NA values for dates whose format differs from that of the parser. As the format may change on random positions of the vector I tried to use the following sentence:

departureDate <- tryCatch(mdy_hms(departureDate), 
                             warning = function(w){return(dmy(departureDate))})

Which brought even more NA's as it only applied the warning function call. Is there a way to solve this by using my approach?

Thanks in advance

arlugones
  • 103
  • 7

3 Answers3

1

One method would be to iterate through a list of candidate formats and apply it only to dates not previously parsed correctly.

fmts <- c("%b %d, %Y %H:%M:%S %p", "%d/%m/%Y")
dates <- rep(Sys.time()[NA], length(departureDate))
for (fmt in fmts) {
  isna <- is.na(dates)
  if (!any(isna)) break
  dates[isna] <- as.POSIXct(departureDate[isna], format = fmt)
}
dates
#  [1] "2020-08-17 12:00:00 PDT" "2019-11-19 12:00:00 PST" "2020-12-21 12:00:00 PST"
#  [4] "2020-12-24 12:00:00 PST" "2020-12-24 12:00:00 PST" "2020-04-19 12:00:00 PDT"
#  [7] "2019-06-28 00:00:00 PDT" "2019-08-16 00:00:00 PDT" "2019-02-04 00:00:00 PST"
# [10] "2019-04-10 00:00:00 PDT" "2019-07-28 00:00:00 PDT" "2019-07-26 00:00:00 PDT"
# [13] "2020-06-22 12:00:00 PDT" "2020-04-05 12:00:00 PDT" "2021-05-01 12:00:00 PDT"
as.Date(dates)
#  [1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28"
#  [8] "2019-08-16" "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05"
# [15] "2021-05-01"

I encourage you to put the most-likely formats first in the fmts vector.

The way this is set up, as soon as all elements are correctly found, no further formats are attempted (i.e., break).


Edit: if there is a difference in LOCALE where AM/PM are not locally recognized, then one method would be to first remove them from the strings:

departureDate <- gsub("\\s[AP]M$", "", departureDate)
departureDate
#  [1] "Aug 17, 2020 12:00:00" "Nov 19, 2019 12:00:00" "Dec 21, 2020 12:00:00"
#  [4] "Dec 24, 2020 12:00:00" "Dec 24, 2020 12:00:00" "Apr 19, 2020 12:00:00"
#  [7] "28/06/2019"            "16/08/2019"            "04/02/2019"           
# [10] "10/04/2019"            "28/07/2019"            "26/07/2019"           
# [13] "Jun 22, 2020 12:00:00" "Apr 5, 2020 12:00:00"  "May 1, 2021 12:00:00" 

and then use a simpler format:

fmts <- c("%b %d, %Y %H:%M:%S", "%d/%m/%Y")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Can I use lubridate functions instead? – arlugones Mar 18 '20 at 22:22
  • Sure. Replace `as.POSIXct` with `lubridate::as_datetime` and it works exactly the same (albeit a little faster). – r2evans Mar 18 '20 at 22:26
  • Thing is, I need to parse the dates, can't afford to get `NA`'s on valid dates – arlugones Mar 18 '20 at 22:31
  • Oh I see ... if it's a date, you want a `Date`; if it's a timestamp, you want `POSIXct`, is that it? – r2evans Mar 18 '20 at 22:32
  • Nope... I only need dates as `Date`. Tried your latter suggestions... Didn't work with `lubridate` – arlugones Mar 18 '20 at 22:36
  • How about ending with `as.Date(dates)`? Other than the class being more than you need (fixable with `as.Date`), is there something else not working? – r2evans Mar 18 '20 at 22:40
  • Okay, you must be having a "locale" issue, where `%p` is not matching correctly. Since it isn't critical, we can remove it before parsing ... – r2evans Mar 18 '20 at 22:50
  • If the issue is really locale, then perhaps the abbreviated month names are a problem as well. What does `Sys.getlocale("LC_TIME")` return for you? (For me, it's `"English_United States.1252"`, which might be why my system is producing all valid timestamps and yours is not.) (`?strptime` specifically talks about this locale setting influencing *"names of the days and months, the AM/PM indicator (if used)..."*, of which my solution uses both.) – r2evans Mar 18 '20 at 23:31
1

The ideal situation is that the code should be able to deal with every format on its own, without letting it fall to an exception.

Another issue to take into account is that the myd_hms() function returns dates in the POSIXct data type, whereas dmy() returns the Date type, so they wouldn't mix well together.

The code below applies mdy_hms(), then converts it to Date. It then tests for NA's and applies the second function dmy() on the missing values. More rules can be added in the pipeline at will if more formats are to be recognized.

library(dplyr)

dates.converted <- 
  mdy_hms(departureDate, tz = ) %>% 
  as.Date() %>%
  ifelse(!is.na(.), ., dmy(departureDate)) %>%
  structure(class = "Date")

print(dates.converted)

Output

 [1] "2020-08-17" "2019-11-19" "2020-12-21" "2020-12-24" "2020-12-24" "2020-04-19" "2019-06-28" "2019-08-16"
 [9] "2019-02-04" "2019-04-10" "2019-07-28" "2019-07-26" "2020-06-22" "2020-04-05" "2021-05-01"
user2332849
  • 1,421
  • 1
  • 9
  • 12
1

We can use lubridate::parse_date_time which can take multiple formats.

lubridate::parse_date_time(departureDate, c('%b %d, %Y %I:%M:%S %p', '%d/%m/%Y'))

#[1] "2020-08-17 UTC" "2019-11-19 UTC" "2020-12-21 UTC" "2020-12-24 UTC"
#[5] "2020-12-24 UTC" "2020-04-19 UTC" "2019-06-28 UTC" "2019-08-16 UTC"
#[9] "2019-02-04 UTC" "2019-04-10 UTC" "2019-07-28 UTC" "2019-07-26 UTC"
#[13] "2020-06-22 UTC" "2020-04-05 UTC" "2021-05-01 UTC"

Since in departureDate month name is in English, you need the locale to be English as well.

Refer How to change the locale of R? if you have non-English locale.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • I suspect the user will have problems with this, too, if they still have locale issues (see the comments in my answer), since this is using the same `%b` and `%p` locale-specific parameters. (Or maybe not.) – r2evans Mar 19 '20 at 04:20
  • Yes, if OP has non-English locale this will not work. Updated the answer to mention that. – Ronak Shah Mar 19 '20 at 04:27