How to parse string datetime & timezone with Arabic-Hindu digits in Java 8?

Question

I wanted to parse string datetime & timezone with Arabic-Hindu digits, so I wrote a code like this:

    String dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+٠٢:٠٠";
    char zeroDigit = '٠';
    Locale locale = Locale.forLanguageTag("ar");
    DateTimeFormatter pattern = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ssXXX")
            .withLocale(locale)
            .withDecimalStyle(DecimalStyle.of(locale).withZeroDigit(zeroDigit));
    ZonedDateTime parsedDateTime = ZonedDateTime.parse(dateTime, pattern);
    assert parsedDateTime != null;

But I received the exception:

java.time.format.DateTimeParseException: Text '٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+٠٢:٠٠' could not be parsed at index 19

I checked a lot of questions on Stackoverflow, but I still don't understand what I did wrong.

It works fine with dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+02:00" when the timezone doesn't use Arabic-Hindu digits.

@OleV.V. I mean this: https://en.wikipedia.org/wiki/Eastern_Arabic_numerals — Daria Pydorenko, Nov 08 '21 at 13:48
Like @OleV.V. mentioned, ISO formats need ASCII numerals. Hence, before you subject the to a `DateTimeFormatter`, convert the digits using the mapping shown in this SO question - https://stackoverflow.com/questions/14834846/what-is-the-range-for-arabic-indic-digits-hindu-arabic-numeral-utf8-from-0-to. — Sree Kumar, Nov 08 '21 at 14:04
@SreeKumar Thanks. Interesting to see that your link indeed uses the term *Hindu–Arabic* about those digits. I don’t think I’ve seen or heard that before. — Ole V.V., Nov 08 '21 at 14:08

Ole V.V. · Accepted Answer · 2021-11-09T06:15:32.350

Your dateTime string is wrong, misunderstood. It obviously tries to conform to the ISO 8601 format and fails. Because the ISO 8601 format uses US-ASCII digits.

The classes of java.time (Instant, OffsetDateTime and ZonedDateTime) would parse your string without any formatter if only the digits were correct for ISO 8601. In the vast majority of cases I would take your avenue: try to parse the string as it is. Not in this case. To me it makes more sense to correct the string before parsing.

    String dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+٠٢:٠٠";
    char[] dateTimeChars = dateTime.toCharArray();
    for (int index = 0; index < dateTimeChars.length; index++) {
        if (Character.isDigit(dateTimeChars[index])) {
            int digitValue = Character.getNumericValue(dateTimeChars[index]);
            dateTimeChars[index] = Character.forDigit(digitValue, 10);
        }
    }
    
    OffsetDateTime odt = OffsetDateTime.parse(CharBuffer.wrap(dateTimeChars));
    
    System.out.println(odt);

Output:

2021-11-08T02:21:08+02:00

Edit: It will be even better, of course, if you can educate the publisher of the string to use US-ASCII digits.

Edit: I know the Wikipedia article I link to below says:

Representations must be written in a combination of Arabic numerals and the specific computer characters (such as "-", ":", "T", "W", "Z") that are assigned specific meanings within the standard; …

This is one thinkable cause of the confusion. The article Arabic numerals linked to says:

Arabic numerals are the ten digits: 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9.

Edit: How I convert each digit: Character.getNumericValue() converts from a char representing a digit to an int equal to the number that the digit represents, so '٠' to 0, '٢' to 2, etc. It works for all characters that are digits (not only Arabic and ASCII ones). Character.forDigit() performs sort of the opposite conversion, only always to US ASCII, so 0 to '0', 2 to '2', etc.

Edit: Thanks to @Holger for drawing my attention to CharBuffer in this context. A CharBuffer implements CharSequence, the type that the parse methods of java.time require, so saves us from converting the char array back to a String.

Links

In the loop where you are using the code point, if you use the ASCII to Arabic numeral mapping in this question - https://stackoverflow.com/questions/14834846/what-is-the-range-for-arabic-indic-digits-hindu-arabic-numeral-utf8-from-0-to - it will be more accurate, I think. Or have I got your code wrong? — Sree Kumar, Nov 08 '21 at 14:06
@SreeKumar we can not tell whether you got the code wrong, as you didn’t explain why you think, this code didn’t handle the digits correctly or why a Q&A not even containing code “will be more accurate” (more accurate in what regard?). — Holger, Nov 08 '21 at 14:31
@SreeKumar I am passing a `char` to both `Character.isDigit(char)` and `getNumericValue(char)`. For Arabic digits I doubt that it makes any difference. — Ole V.V., Nov 08 '21 at 14:44
Where in the question is an ISO 8601 format used? I can see that the format is essentially identical to ISO 8601, but it’s a new, localized DateTimeFormatter, not one of the ISO 8601 static fields of DateTimeFormatter. I think the question is really about whether the offset field uses localized digits. — VGR, Nov 08 '21 at 15:23
@VGR It’s my interpretation. To me it’s clear. Feel free to disagree about it. I’d welcome your answer too and be interested. — Ole V.V., Nov 08 '21 at 15:26
@Holger @OleV.V. My mistake. I mistook `Character.getNumericValue()` to be returning code points for numerals other than the ASCII set. However, it seems to be converting the Arabic numerals to ASCII correctly. — Sree Kumar, Nov 09 '21 at 05:16
@SreeKumar Thanks for spelling your mistake out. I think others could make the same, so I have added a paragraph explaining. — Ole V.V., Nov 09 '21 at 06:16

VGR · Answer 2 · 2021-11-08T19:39:20.433

The error message states that the problem is at index 19 in the input string.

Character 19 is the + character in your input string. This means the offset (represented by XXX in your pattern) cannot be parsed.

The problem is not the + itself. The problem is that timezone offsets, like +05:00, are never localized.

The documentation doesn’t talk about this, so I had to go to the source code of DateTimeFormatterBuilder to verify it.

Inside that class is this inner class:

static final class OffsetIdPrinterParser implements DateTimePrinterParser {

In that class, we can find a parse method which has calls to the private parseHour, parseMinute, and parseSeconds methods.

Each of those methods delegates to a private parseDigits method. In that method, we can see that only ASCII digits are considered:

char ch1 = parseText.charAt(pos++);
char ch2 = parseText.charAt(pos++);
if (ch1 < '0' || ch1 > '9' || ch2 < '0' || ch2 > '9') {
    return false;
}

So, the answer here is that the timezone offset must consist of ASCII digits, regardless of the locale.

Nice observation about the digits in the timezone. I think @OleV.V.'s code to translate the text into ASCII numerals will take care of it now. — Sree Kumar, Nov 09 '21 at 07:36

How to parse string datetime & timezone with Arabic-Hindu digits in Java 8?

2 Answers2

Links