Why R package lubridate can't parse vector with multiple formats?

Question

I'm using package lubridate to parse a vector of heterogeneously-formatted dates and convert them to string, like this:

parse_date_time(c('12/17/1996 04:00:00 PM','4/18/1950 0130'), c('%m/%d/%Y %I:%M:%S %p','%m/%d/%Y %H%M'))

This is the result:

[1] NA NA
Warning message:
All formats failed to parse. No formats found.

If I remove the %p in the 1st format string, it incorrectly parses the 1st date string, and still doesn't parse the 2nd, like so:

[1] "1996-12-17 04:00:00 UTC" NA                       
Warning message:
 1 failed to parse.

The 4PM time in the string is parsed to 4AM in the result.

Has anyone experienced this strange behavior?

I am able to replicate the error. `parse_date_time(x = mydates, orders = c('m/d/Y I:M:S p','m/d/Y HM'), locale = "eng")` gives the correct value for the first but not second date/time. `parse_date_time(mydates1, orders = c('%m/%d/%Y %H%M'))` doesn't work but.... `strptime(mydates1, format="%m/%d/%Y %H%M")` does work though... when `mydates1` is just the second date `4/18/1950 0130` — jalapic, May 20 '15 at 01:48
I believe the issue is with the `0130` of the second string. If you change it to `4/18/1950 01:30`, I believe things will work as expected. — JasonAizkalns, May 20 '15 at 01:48
@JasonAizkalns '0130' can be parsed alone: `parse_date_time('0130', '%H%M')` gives `"0-01-01 01:30:00 UTC"`. — , May 20 '15 at 01:55
@Pascal good catch, actually, it looks like that lack of a leading zero on the `4` from `4/18/1950` is the issue. `parse_date_time("4/18/1950 0130", "%m%d%Y %H%M")` fails, but `parse_date_time("04/18/1950 0130", "%m%d%Y %H%M")` works. — JasonAizkalns, May 20 '15 at 02:08
I would recommend to contact the maintainer (`maintainer("lubridate")`), as, to complete @JasonAizkalns comment, `parse_date_time("4/18/1950", "%m%d%Y")` gives `"1950-04-18 UTC"`. — , May 20 '15 at 02:11
@Pascal I've added [this issue](https://github.com/hadley/lubridate/issues/326) to the GitHub repo. — JasonAizkalns, May 20 '15 at 02:25
Thanks, everyone. Before posting here I reported this to lubridate's GitHub, and already heard back. They've created issue [327](https://github.com/hadley/lubridate/issues/327) for this. I also noticed that on MacOSX Yosemite the `%p` actually works, interpreting 04:00:00 AM to 1600 hours. — Jesus Ramos, May 21 '15 at 02:27

score 1 · Answer 1 · edited May 23 '17 at 12:08

This probably relate to your system locale.

parse_date_time {lubridate}

p : AM/PM indicator in the locale. Used in conjunction with I and not with H. An empty string in some locales.

Because different languages have different string for AM/PM, if your locale is not English, lubridate will not pick up the AM/PM indicator even if you specify it.

The locale in OS could include display language, time format, time zones. I'm using English windows with US time zone and Chinese locale, so I had been fighting with AM/PM in time parsing too.

Sys.getlocale("LC_TIME")
[1] "Chinese (Simplified)_China.936"

You can specify locale in parse_date_time {lubridate}, but it didn't work for me at first:

Sys.setlocale("LC_TIME", "en_US") 
[1] ""
Warning message:
In Sys.setlocale("LC_TIME", "en_US") :
  OS reports request to set locale to "en_US" cannot be honored

locales {base}

The locale describes aspects of the internationalization of a program. Initially most aspects of the locale of R are set to "C" (which is the default for the C language and reflects North-American usage). strptime for uses of category = "LC_TIME".

Then I found this and used this to success:

Sys.setlocale("LC_TIME", "C")
[1] "C"

After this the parsing works:

parse_date_time('12/17/1996 04:00:00 PM', '%m/%d/%Y %I:%M:%S %p')
[1] "1996-12-17 16:00:00 UTC"

You can also specify time zone and locale

parse_date_time('12/17/1996 04:00:00 PM', '%m/%d/%Y %I:%M:%S %p', tz = "America/New_York", locale = "C")
[1] "1996-12-17 16:00:00 EST"

You are right that locales are the issue with AM/PM. There is a dedicated [issue](https://github.com/hadley/lubridate/issues/327) open for this. Note that OP's problem is not only about PM/AM. — VitoshKa, Oct 31 '15 at 15:02

score 1 · Answer 2 · answered Oct 31 '15 at 15:40

The problem with %p part is locale related. See this issue.

The inability to parse has to do with the way lubridate guesser works.

Tthere are two ways lubridate infers formats, flex and exact. With flex matching all numeric elements can have flexible length (for example both 4 and 04 for day will work), but then, there must be non-numeric separators between the elements. For the exact matcher there need not be non-numeric separators but elements must have exact number of digits (like 04).

Unfortunately you cannot combine both matchers within one expression. It would be extremely hard to fix this and preserve the current flexibility of the lubridate parser.

In your example

> parse_date_time('4/18/1950 0130', 'mdY HM')
[1] NA
Warning message:
All formats failed to parse. No formats found.

you want to perform flex matching on the date part 4/18/1950 and exact matching on time part 0130.

Please note that if your date-time is in fully flex, or fully exact format the parsing will work as expected:

> parse_date_time('04/18/1950 0130', 'mdY HM')
[1] "1950-04-18 01:30:00 UTC"
> parse_date_time('4/18/1950 1:30', 'mdY HM')
[1] "1950-04-18 01:30:00 UTC"

The lubridate 1.4.1 "fixes" this by adding a new argument to parse_date_time, exact=FALSE. When set toTRUE the orders argument is interpreted as containing exact strptime formats and no guessing or training is performed. This way you can add as many exact formats as you want and you will also gain in speed because no guessing is performed at all.

> parse_date_time(c('12/17/1996 04:00:00','4/18/1950 0130'),
+                 c('%m/%d/%Y %I:%M:%S','%m/%d/%Y %H%M'),
+                 exact = T)
[1] "1996-12-17 04:00:00 UTC" "1950-04-18 01:30:00 UTC"

Relatedly, there was an explicit requested asking for such an option.

Why R package lubridate can't parse vector with multiple formats?

2 Answers2

Linked