0

I want to parse some dates in Java, but the format is not defined and could be a lot of them (any ISO-8601 format which is already a lot, Unix timestamp in any unit, and more) Here are some samples :

  • 1970-01-01T00:00:00.00Z
  • 1234567890
  • 1234567890000
  • 1234567890000000
  • 2021-09-20T17:27:00.000Z+02:00

The perfect parsing might be impossible because of ambiguous cases but, a solution to parse most of the common dates with some logical might be achievable (for example timestamps are considered in seconds / milli / micro / nano in order to give a date close to the 2000 era, dates like '08/07/2021' could have a default for month and day distinction). I didn't find any easy way to do it in Java while in python it is kind of possible (not working on all my samples but at least some of them) using infer_datetime_format of panda function to_datetime (https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html).

Are there some easy approach in Java?

bloub
  • 510
  • 1
  • 4
  • 21
  • Related: (1) [More beautiful Multiple DateTimeFormatter?](https://stackoverflow.com/questions/67334271/more-beautiful-multiple-datetimeformatter) (2) [Parse any date in Java](https://stackoverflow.com/questions/3389348/parse-any-date-in-java). There are more. – Ole V.V. Sep 28 '21 at 17:22

4 Answers4

1

Well, first of all, I agree with rzwitserloot here that date parsing in free format is extremely difficult and full of ambiguities. So you are skating on thin ice and will eventually run into trouble if you just assume that a user input will be correctly parsed the way you think it will.

Nevertheless, we could make it work if I assume either of the following:

  • You simply don't care if it will be parsed incorrectly; or

  • You are doing this for fun or for learning purposes; or

  • You have a banner, saying:

    If the parsing goes wrong, it's your fault. Don't blame us.

Anyway, the DateTimeFormatterBuilder is able to build a DateTimeFormatter which could be able to parse a lot of different patterns. Since a formatter supports optional parsing, it could be instructed to try to parse a certain value, or skip that part if no valid value could be found.

For instance, this builder is able to parse a fairly wide range of ISO-like dates, with many optional parts:

DateTimeFormatterBuilder builder = new DateTimeFormatterBuilder()
    .appendPattern("uuuu-M-d")
    .optionalStart()
        .optionalStart().appendLiteral(' ').optionalEnd()
        .optionalStart().appendLiteral('T').optionalEnd()
        .appendValue(ChronoField.HOUR_OF_DAY)
        .optionalStart()
            .appendLiteral(':')
            .appendValue(ChronoField.MINUTE_OF_HOUR)
            .optionalStart()
                .appendLiteral(':')
                .appendValue(ChronoField.SECOND_OF_MINUTE)
                .optionalStart()
                    .appendFraction(ChronoField.NANO_OF_SECOND, 1, 9, true)
                .optionalEnd()
            .optionalEnd()
        .optionalEnd()
        .appendPattern("[XXXXX][XXXX][XXX][XX][X]")
    .optionalEnd();
DateTimeFormatter formatter = builder.toFormatter(Locale.ROOT);

All of the strings below can be successfully parsed by this formatter.

Stream.of(
    "2021-09-28",
    "2021-07-04T14",
    "2021-07-04T14:06",
    "2001-09-11 00:00:15",
    "1970-01-01T00:00:15.446-08:00",
    "2021-07-04T14:06:15.2017323Z",
    "2021-09-20T17:27:00.000+02:00"
).forEach(testcase -> System.out.println(formatter.parse(testcase)));

Als you can see, with optionalStart() and optionalEnd(), you could define optional portions of the format.

There are many more patterns you probably want to parse. You could add those patterns to the abovementioned builder. Alternatively, the appendOptional​(DateTimeFormatter) method could be used to include multiple builders.

MC Emperor
  • 22,334
  • 15
  • 80
  • 130
0

I don't know any standard library with this functionality, but you can always use DateTimeFormatter class and guess the format looping over a list of predefined formats, or using the ones provides by this class.

This is a typichal approximation of what you want to archive.

Here you can see and old implementation https://balusc.omnifaces.org/2007/09/dateutil.html

Javier Toja
  • 1,630
  • 1
  • 14
  • 31
  • My current solution is to loop over a list of predefined formats that are stored inside a database (and ordered from more common to less common). The list in the database could then be updated with specific header usage through an API but I want to remove or limit as most as possible this part. – bloub Sep 28 '21 at 11:25
  • I would have this formats inf a class or enum rather in database, java date formats support some generic characters which allow you to have only a subset and get a good matching, but keep in mid for edge cases that are unavoidable in some very rare formats like only digits that can be interpreted in various ways. – Javier Toja Sep 28 '21 at 11:27
  • Check this link https://balusc.omnifaces.org/2007/09/dateutil.html @bloub – Javier Toja Sep 28 '21 at 11:30
0

The perfect parsing might be impossible because of ambiguous cases but, a solution to parse most of the common dates with some logical might be achievable

Sure, and such wide-ranging guesswork should most definitely not be part of a standard java.* API. I think you're also wildly underestimating the ambiguity. 1234567890? It's just flat out incorrect to say that this can reasonably be parsed.

You are running into many, many problems here:

  • Java in general prefers throwing an error instead of guessing. This is inherent in the language (java has few optional syntax constructs; semicolons aren't optional, () for method invocations are not optional, java intentionally does not have 'truthy/false', i.e. if (foo) is only valid if foo is an expression of the boolean type, unlike e.g. python where you can stick anything in there and there's a big list of what counts as falsy, with the rest being considering truthy. When in rome, be like the romans: If this tenet annoys you, well, either learn to love it, begrudgingly accept it, or program in another language. This idea is endemic in the entire ecosystem. For what it is worth, given that debugging tends to take far longer than typing the optional constructs, java is objectively correct or at least making rational decisions for being like this.

  • Either you can't bring in the notion that 'hey, this number is larger than 12, therefore it cannot possibly be the month', or, you have to accept that whether a certain date format parsers properly depends on whether the day-of-month value is above or below 12. I would strongly advocate that you avoid a library that fails this rule like the plague. What possible point is there, in the end? "My app will parse your date correctly, but only for about 3/5ths of all dates?" So, given that you can't/should not take that into account, 1234567890, is that seconds-since-1970? milliseconds-since-1970? Is that the 12th of the 34th month of the year 5678, the 90th hour, and assumed zeroes for minutes, seconds, and millis? If a library guesses, that library is wrong, because you should not guess unless you're 95%+ sure.

  • The obvious and perennial "do not guess" example is, of course, 101112. Is that November 10th, 2012 (european style)? Is that October 11th, 2012 (American style), or is that November 12th, 2010 (ISO style)? These are all reasonable guesses and therefore guessing is just wrong here. Do. Not. Guess. Unless you're really sure. Given that this is a somewhat common way to enter dates, thus: Guessing at all costs is objectively silly (see above). Guessing only when it's pretty clear and erroring out otherwise is mostly useless, given that ambiguity is so easy to introduce.

  • The concept of guessing may be defensible but only with a lot more information. For example, if you give me the input '101112100000', there's no way it's correct to guess here. But if you also tell me that a human entered this input, and that human is clearly clued into, say, german locale, then I can see the need to be able to turn that into '10th of november 2012, 10 o'clock in the morning': Interpreting as seconds or millis since some epoch is precluded by the human factor, and the day-month-year order by locale.

You asked:

Are there some easy approach in Java?

This entire question is incorrect. The in Java part needs to be stripped from this question, and then the answer is a simple: No. There is no simple way to parse strings into date/times without a lot more information than just the input string. If another library says they can do that, they are lying, or at least, operating under a list of cultural and source assumptions as long as my leg, and you should not be using that library.

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
  • I have your point and it's true that it's not a Java-specific problem, but as I encountered it with Java it's why I asked this question. For any library claiming to do that I don't think "operating under a list of cultural and source assumptions" is true. I mean, of course, defaults are inevitable for ambiguous cases but they could be specified and configurable. – bloub Sep 28 '21 at 11:31
  • @bloub pandas certainly is. Instead of providing a locale, you get a boolean flag for 'is the date first or the month first', which is myopic; it treats the world as consisting of non-programming older humans in most of mainland europe and the US, ignoring just about everybody else. Actual reasonable guesses as to what an entered date might mean are quite a bit more involved than 'if I have 2 numbers, should I assume the day-of-month is first, or the month-of-year is first?'. Accountants often use week numbers, standards bodies (and thus, people working with them) put year first. – rzwitserloot Sep 28 '21 at 11:42
  • Thus, pandas' existence is anecdotal evidence that libraries like pandas's to_datetime shouldn't exist in the first place. – rzwitserloot Sep 28 '21 at 11:43
  • This answer is 100 % correct. I was thinking that if we could limit the range, we may be able to distinguish a number of different time units. But since 1970 and 2021 are both within the expected range, 1234567890 could easily be nanoseconds (1970-01-01T00:00:01.234567890Z), microseconds (1970-01-01T00:20:34.567890Z), milliseconds (1970-01-15T06:56:07.890Z) or seconds (2009-02-13T23:31:30Z), and we have no way of telling (they are hardly minutes, though (4317-04-25T19:30:00Z)). – Ole V.V. Sep 28 '21 at 17:18
  • @OleV.V. Indeed. But, I wouldn't be so quick to exclude 4317 and friends. If you exclude anything on the basis that the date would be 'out of range', it means that boundaries exist: 1234567890 would parse correctly, but e.g. 1000007890 all of a sudden parses incorrectly because it crossed an arbitrary boundary and now the preferred parse strategy is no longer discarded due to the number being 'out of range'. This doesn't sound like a property anybody would ever want in a programming library call. – rzwitserloot Sep 28 '21 at 22:28
  • I’m generally in favour of documented range checks, also when the exact limits of my range are necessarily somewhat arbitrary. Whether year 4317 should be considered in or out of range should depend in analysed business requirements, not on what some random Stacker like me happened to write in a comment here. – Ole V.V. Sep 29 '21 at 03:26
0

FTA (https://github.com/tsegall/fta) is designed to solve exactly this problem (among others). It currently parses thousands of formats and does not do it via a predefined set, so typically runs extremely quickly. In this example we explicitly set the DateResolutionMode, however, it will default to something intelligent based on the Locale. Here is an example:


import com.cobber.fta.dates.DateTimeParser;
import com.cobber.fta.dates.DateTimeParser.DateResolutionMode;

public abstract class Simple {

    public static void main(final String[] args) {
        final String[] samples =  { "1970-01-01T00:00:00.00Z", "2021-09-20T17:27:00.000Z+02:00", "08/07/2021" };

        final DateTimeParser dtp = new DateTimeParser().withDateResolutionMode(DateResolutionMode.MonthFirst).withLocale(Locale.ENGLISH);
        for (final String sample : samples)
            System.err.printf("Format is: '%s'%n", dtp.determineFormatString(sample));
    }
}

Which will give the following output:

Format is: 'yyyy-MM-dd'T'HH:mm:ss.SSX'
Format is: 'yyyy-MM-dd'T'HH:mm:ss.SSSX'
Format is: 'MM/dd/yyyy'
Tim Segall
  • 11
  • 1