1

I am investigating some mess that has been done to our languages-support (it is used in our IDN functionality, if that rings a bell)...

I used an SQL GUI client to quickly see the structure of our language definitions. So, when I do select charcodes from ourCharCodesTable where language = 'myLanguage';, I get results for some values of 'myLanguage', E.G.:

myLanguage = "ASCII":
result = "-0123456789abcdefghijklmnopqrstuvwxyz"

myLanguage = "Russian":
result = "-0123456789абвгдежзийклмнопрстуфхцчшщъьюяѐѝ"
(BTW: can already see a language mistake here, if you are a polyglot like me!)

I thought: "OK, I can work with this! Let's write a Java program and put some logic to find mistakes..."
I need my logic to receive one char at a time from the 'result' and, according to the current table context, apply my logic to flag if it should or should not be there...

However! When I am at:
myLanguage = "Belarusian" :
One would think this language is rather similar to Russian, but the very format of the result, as coming from the database is totally different: result = "U+002D\nU+0030\nU+0030..." !

And, there's another format! myLanguage = "Chinese" :
result = "#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"

FWIW: charcodes column is of CLOB type.

I know U+002D is '-' and U+0030 is '0'...

My current idea is to:
1] Check if the entire response is in 'щ' format or 'U+0449` format (whether the 'U+****'s are separated with ';', ',' or '\n' - I am just going to treat them as standalone chars)
a. If it is the "easy one", just send the char on to my testing method
b. If it is the "hard one", get the hex part (0449), convert to decimal (1097) and cast to char (щ)

So, again, my questions are:

  • What is this "U+043E;U+006F,U+004D" format?
  • If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?
DraxDomax
  • 1,008
  • 1
  • 9
  • 28

3 Answers3

2

UPDATED

What is this "U+043E;U+006F,U+004D" format?

In a comment, OP provided a link to https://www.iana.org/domains/idn-tables/tables/academy_zh_1.0.txt, which has the following text:

  • This table conforms to the format specified in RFC 3743.

RFC 3743 can be found at https://www.rfc-editor.org/rfc/rfc3743


If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?

It is not a widely-used standard, so Java does not offer that natively, but it is easy to convert to regular String using regex, so you can then process the string normally.

// Java 11+
static String decodeUnicode(String input) {
    return Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input).replaceAll(mr ->
            Character.toString(Integer.parseInt(mr.group().substring(2), 16)));
}
// Java 9+
static String decodeUnicode(String input) {
    return Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input).replaceAll(mr ->
            new String(new int[] { Integer.parseInt(mr.group().substring(2), 16) }, 0, 1));
}
// Java 1.5+
static String decodeUnicode(String input) {
    StringBuffer buf = new StringBuffer();
    Matcher m = Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input);
    while (m.find()) {
        String hexString = m.group().substring(2);
        int codePoint = Integer.parseInt(hexString, 16);
        String unicodeCharacter = new String(new int[] { codePoint }, 0, 1);
        m.appendReplacement(buf, unicodeCharacter);
    }
    return m.appendTail(buf).toString();
}

Test

System.out.println(decodeUnicode("#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"));

Output

#
-;-;=,M,-
0;0;0
Community
  • 1
  • 1
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • 1
    U+xxxx is standard notation - https://stackoverflow.com/questions/1273693. No idea what the other stuff (commas, semicolons, etc) might mean though. – Stephen C Feb 01 '21 at 04:04
  • @StephenC The rest is whatever the `ourCharCodesTable` table defines the value of column `charcodes` to be, so that would be some proprietary specification. --- I know that `U+HHHH` is a standard notation for referring to Unicode characters using hex numbers, I've just never seen strings *encoded* ("formatted") with that notation before. Most programming languages use ``\uHHHH``. JSON too. XML uses `HHHH;`. – Andreas Feb 01 '21 at 04:32
  • It's not just us. It's pretty easy to find online such tables: (careful, large page) https://www.iana.org/domains/idn-tables/tables/academy_zh_1.0.txt - IANA however are not such cool people you can ask... I asked our devs who have filled out that table in the first place but their answer doesn't really inspire confidence. One guy, who might know is Patrick Mevzek, if I am lucky he might see this question! – DraxDomax Feb 01 '21 at 10:10
  • @Andreas: fair enough, I've removed my comments. – Joachim Sauer Feb 01 '21 at 12:19
  • Thanks for investing a considerable amount of time trying to help! I wish not to sound greedy or lazy but the truth is that I am aware of that RFC, yet unable to understand it. I couldn't find there anything like: "semicolons are parts of the UTF-16 encoding, commas mean the chars can be considered equivalent and is just a tidy way of saying they are both acceptable (instead of putting each in a new line)" - I realize nothing worth paying an engineer to do is so easy to understand but the RFC guys sure didn't spare the rod here! – DraxDomax Feb 01 '21 at 19:50
  • @DraxDomax In the RFC, on page 25, see section "5.2. Comments and **Explanation of Syntax**". Starts with explanation of comments and blank lines. Halfway down page 26, in a paragraph starting with "The table has three columns, separated by **semicolons**:" and the next 6 paragraphs, it explains semicolons, spaces, and commas. --- *FYI:* That "semicolons" is the *only* use of the word in the entire RFC, so it was very easy to find, all you have to do is try. – Andreas Feb 02 '21 at 02:34
  • you know, it occurred to me that I might actually search "semicolon" but I am too pessimistic these days... I didn't even "try" :( Thanks for propping me up on this. May I offer to make a contribution of $10 to a cause that's dear to you? – DraxDomax Feb 02 '21 at 03:58
1

U+0000 is a representation of a Unicode Codepoint and the format is defined in Apendix A of the Unicode Standard. The numbers are simply the hex-encoded number of the represented codepoint. For historical reasons they are always left-padded to at least 4 digits with 0, but can be up to 6 digits long.

It is not primarily meant as a machine-readable encoding, but rather as a human-readable representation of Unicode codepoints for use in running text (i.e. paragraphs such as this one). Note especially that this format does not have a way to distinguish a four-character number followed by some numbers from a 5- or 6-digit number. So U+123456 could be interpreted in 3 different was: U+1234 followed by the text 56, U+12345 followed by the text 6 or U+123456. This makes it unsuited for automatic replacement and use as a general-purpose encoding.

As such there is no built-in functionality to parse this into its equivalent String or similar in Java.

The following code can be used to parse a single Unicode codepoint reference into the appropriate codepoint in a String:

  public static String codePointToString(String input) {
    if (!input.startsWith("U+")) {
      throw new IllegalArgumentException("Malformed input, doesn't start with U+");
    }
    int codepoint = Integer.parseInt(input.substring(2), 16);
    if (codepoint < 0 || codepoint > Character.MAX_CODE_POINT) {
      throw new IllegalArgumentException("Malformed input, codepoint value out of valid range: " + codepoint);
    }
    return Character.toString(codepoint);
  }

(Before Java 11 the return line needs to use new String(new int[] { codepoint }, 0, 1) instead).

And if you want to replace all Unicode codepoints represented in a text by their actual text (which might render it unreadable in some cases) you can use this (together with the method above):

  private static final Pattern PATTERN = Pattern.compile("U\\+[0-9A-Za-z]{4,6}");
  
  public static String decodeCodePoints(String input) {
    return PATTERN
        .matcher(input)
        .replaceAll(result -> codePointToString(result.group()));
  }
Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • 1
    In Java 11, [`Character.toString​(int codePoint)`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Character.html#toString(int)) would be better to use than the [`String​(int[] codePoints, int offset, int count)`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#%3Cinit%3E(int%5B%5D,int,int)) constructor. – Andreas Feb 01 '21 at 11:49
  • 1
    @Andreas: good point, I've updated the code. – Joachim Sauer Feb 01 '21 at 11:56
-2

Actually, I wrote an Open Source Library called MgntUtils that has a utility that can very much help you. The codes that you see are unicode sequences where each U+XXXX represents a character. The utility in the library can convert any string in any language (including special characters) into Unicode sequences and vise-versa. Here is a sample of how it works:

result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

The output of this code is:

\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World

The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc

Here is javadoc for the class StringUnicodeEncoderDecoder

Michael Gantman
  • 7,315
  • 2
  • 19
  • 36
  • 1
    `\u0048` is quite a different format from U+0048. – Joachim Sauer Feb 01 '21 at 12:18
  • Yes, but they do mean the same symbol. And for testing or restoring data, you can copy-paste your codes U+XXXX and just replace "U+" to "\u" and then you can restore the data. This utility helped me many times which is why I published it as part of Open Source library. I don't believe this answer deserved downvoting as it gives the user an option to diagnose the problem and restore the data. But, I am thankful to @Joachim Sauer that he gave his reasoning for downvoting – Michael Gantman Feb 01 '21 at 12:21
  • 1
    They are also not quite compatible: The `\uXXXX` notation is always 4 characters and represents non-BMP codepoints as two surrogates (i.e. `\uXXXX\uXXXX` instead of `U+XXXXXX`). – Joachim Sauer Feb 01 '21 at 12:26
  • *"they do mean the same symbol"* Irrelevant. OP need to process `U+HHHH`, not ``\uHHHH``, so if the method doesn't do that, it's useless. --- *"you can copy-paste"* OP needs code to do this, not something that must be manually edited. --- Down-voted because the answer is *not useful*, given that it cannot help process the input that OP needs to process. – Andreas Feb 01 '21 at 12:30