2

If I look in UAX#15 Section 9, there is sample code to check for normalization. That code uses the NFC_QC property and checks CCC ordering, as expected. It looks great except this line puzzles me if (Character.isSupplementaryCodePoint(ch)) ++i;. It seems to be saying that if a character is supplemental (i.e. >= 0x10000), then I can just assume the next character passes the quick check without bothering to check the NFC_QC property or CCC ordering on it.

Theoretically, I can have, say, a starting code point, followed by a supplemental code point with CCC>0, followed by a third code point with CCC>0 and lower than that of the second code point or NFC_QC==NO, and it will STILL pass the NFC quick check, even tho that would seem to not be in NFC form. There are a bunch of supplemental code points with CCC=7,9,216,220,230, so it seems like there are a lot of possibilities to hit this case. I guess this can work if we can assume that it will always be the case throughout future versions of Unicode that all supplemental characters with CCC>0 will also have NFC_QC==No.

Is this sample code correct? If so, why is this supplemental check valid? Are there cases that would produce incorrect results if that supplemental check were removed?

Here is the code snippet copied directly from that link.

public int quickCheck(String source) {
    short lastCanonicalClass = 0;
    int result = YES;
    for (int i = 0; i < source.length(); ++i) {
        int ch = source.codepointAt(i);
        if (Character.isSupplementaryCodePoint(ch)) ++i;
        short canonicalClass = getCanonicalClass(ch);
        if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
            return NO;        }
        int check = isAllowed(ch);
        if (check == NO) return NO;
        if (check == MAYBE) result = MAYBE;
        lastCanonicalClass = canonicalClass;
    }
    return result;
}
Andrew
  • 8,198
  • 2
  • 15
  • 35
Dave
  • 320
  • 3
  • 9

1 Answers1

0

The sample code is correct,1 but the part that concerns you has little to do with Unicode normalization. No characters in the string are actually skipped, it’s just that Java makes iterating over a string’s characters somewhat awkward.

The extra increment is a workaround for a historical wart in Java (which it happens to share with JavaScript and Windows, two other early adopters of Unicode): Java Strings are arrays of Java chars, but Java chars are not Unicode (abstract) characters or (concrete numeric) code points, they are 16-bit UTF-16 code units. This means that every character with code point C < 1 0000h takes up one position in a Java String, containing C, but every character with code point C ≥ 1 0000h takes two, as specified by UTF-16: the high or leading surrogate D800h + (C − 1 0000h) div 400h and the low or trailing surrogate DC00h + (C − 1 0000h) mod 400h (no Unicode characters are or ever will be assigned code points in the range [D800h, DFFFh], so the two cases are unambiguously distinguishable).

Because Unicode normalization operates in terms of a sequence of Unicode characters and cares little for the particulars of UTF-16, the sample code calls String.codePointAt(i) to decode the code point that occupies either position i or the two positions i and i+1 in the provided string, processes it, and uses Character.isSupplementaryCodePoint to figure out whether it should advance one or two positions. The way the loop is written treats the “supplementary” two-unit case like an unwanted stepchild, but that’s the accepted Java way of treating them.

1 Well, correct up to a small spelling error: codepointAt should be codePointAt.

Alex Shpilkin
  • 776
  • 7
  • 17