5

I would like to have a generic and fast parser for dates that comes with random format like:

  • 2018
  • 2018-12-31
  • 2018/12/31
  • 2018 dec 31
  • 20181231151617
  • 2018-12-31T15:16:17
  • 2018-12-31T15:16:17.123456
  • 2018-12-31T15:16:17.123456Z
  • 2018-12-31T15:16:17.123456 UTC
  • 2018-12-31T15:16:17.123456+01:00
  • ... so many possibilities

Is there a nice way a or "magic" function do that?

Currently I am planning to use something like this:

val formatter = new DateTimeFormatterBuilder()
  .appendPattern("[yyyy-MM-dd'T'HH:mm:ss]")
  .appendPattern("[yyyy-MM-dd]")
  .appendPattern("[yyyy]")
  // add so many things here
  .parseDefaulting(ChronoField.MONTH_OF_YEAR, 1)
  .parseDefaulting(ChronoField.DAY_OF_MONTH, 1)
  .parseDefaulting(ChronoField.HOUR_OF_DAY, 0)
  .parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0)
  .parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0)
  .parseDefaulting(ChronoField.MICRO_OF_SECOND, 0)
  .toFormatter()


val temporalAccessor = formatter.parse("2018")
val localDateTime = LocalDateTime.from(temporalAccessor)
localDateTime.getHour
val zonedDateTime = ZonedDateTime.of(localDateTime, ZoneId.systemDefault)
val result = Instant.from(zonedDateTime)

But is there a smarter way than specifying hundreds of formats?

Most of answers I found are outdated (pre Java8) or do not focus on performance and a lot of different formats.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
Benjamin
  • 3,350
  • 4
  • 24
  • 49
  • by smarter way - you mean using some preconfigured date formats, so that one would not need to specify different date formats in advance? – Alexey Novakov Feb 03 '19 at 19:58
  • Yes as many as possible – Benjamin Feb 03 '19 at 20:05
  • I would look for some libraries in the area of NLP. Not sure, if you have seen this answer already: https://stackoverflow.com/a/21164291/6176274 – Alexey Novakov Feb 03 '19 at 20:12
  • At first view Natty looks good to parse a single date but I am not sure this NLP library can deals with millions of dates in a few seconds. Moreover, this project is stalled since 3 years. I will test it anyway. TY @AlexeyNovakov – Benjamin Feb 03 '19 at 20:35
  • Akin to [How to parse dates in multiple formats using SimpleDateFormat](https://stackoverflow.com/questions/4024544/how-to-parse-dates-in-multiple-formats-using-simpledateformat) and a number of other questions. Please put more effort into your search (and no, you are correct in not using `SimpleDateFormat`; [my answer here](https://stackoverflow.com/a/45315872/5772882) might be a starting point?) – Ole V.V. Feb 04 '19 at 09:24
  • You have basically found the smart way. There are other useable ways, but there isn’t anything that is doubtless smarter. – Ole V.V. Feb 04 '19 at 09:28
  • 1
    When parsing into a `LocalDateTime` you seem to be losing the vital offset information. Your last two examples denote different points in time but will parse into equal `LocalDateTime` objects. I don’t think you should want that. – Ole V.V. Feb 04 '19 at 09:40

1 Answers1

1

No, there is no nice/magic way to do this, for two main reasons:

  1. There are variations and ambiguities in data formats that make a generic parser very difficult. e.g. 11/11/11

  2. You are looking for very high performance, which rules out any brute-force methods. 1us per date means only a few thousand instructions to do the full parsing.

At some level you are going to have to specify what formats are valid and how to interpret them. The best way to do this is probably one or more regular expressions that extract the appropriate fields from all the allowable combinations of characters that might form a date, and then much simpler validation of the individual fields.

Here is an example that deals with all dates you listed:

val DateMatch = """(\d\d\d\d)[-/ ]?((?:\d\d)|(?:\w\w\w))?[-/ ]?(\d\d)?T?(\d\d)?:?(\d\d)?:?(\d\d)?[\.]*(\d+)?(.*)?""".r

date match {
  case DateMatch(year, month, day, hour, min, sec, usec, timezone) =>
    (year, Option(month).getOrElse("1"), Option(day).getOrElse(1), Option(hour).getOrElse(0), Option(min).getOrElse(0), Option(sec).getOrElse(0), Option(usec).getOrElse(0), Option(timezone).getOrElse(""))
  case _ =>
    throw InvalidDateException
}

As you can see it is going to get very hairy once all the possible dates are included. But if the regex engine can handle it then it should be efficient because the regex should compile to a state machine that looks at each character once.

Tim
  • 26,753
  • 2
  • 16
  • 29