Cannot identify surrogate characters in Java string

Question

I am having trouble identifying surrogate characters in strings like devā́n. I read the relevant questions concerning the topic here on SO, but something is still wrong with this...
As you see, the "natural" length (i just made up that expression) of this string is 5, but "devā́n".length() gives me 6.
That is fine, because ā́ consists of two characters internally (it's not withing the UTF-16 code range). But i would like to get the length of the string as you'd read it or as it's printed, so 5 in this case.

I tried identifying the weirdo chars with the following tricks found here and here, but it doesn't work and i'm always getting 6. Just have a look at this:

//string containing surrogate pair
String s = "devā́n";

//prints the string properly
System.out.println("String: " + s);

//prints "Length: 6"
System.out.println("Length: " + s.length());

//prints "Codepoints: 6"
System.out.println("Codepoints: " + s.codePointCount(0, s.length()));

//false
System.out.println(
        Character.isSurrogate(s.charAt(3)));

//false
System.out.println(
        Character.isSurrogate(s.charAt(4)));

//six code points
System.out.println("\n");
for (int i = 0; i < s.length(); i++) {
    System.out.println(s.charAt(i) + ": " + s.codePointAt(i));
}

Is it maybe possible that ā́ is not a valid pair of surrogate chars? How can i identify such a compound char and count it as only one?

BTW the output of above code is

String: devā́n
Length: 6
Codepoints: 6
false
false


d: 100
e: 101
v: 118
ā: 257
́: 769
n: 110

Stephen C · Accepted Answer · 2023-08-26T03:18:55.497

First of all, the reason that 769 (U+0301) is not testing as a surrogate, is that it is NOT a surrogate code unit at all.

Surrogates are not characters or (valid) Unicode code points. Rather they are 16 bit code units that are used in the UTF-16 representation of a Unicode code point that is is outside of plane 0; i.e. Unicode code points U+10000 and beyond. Surrogates are code units in the range U+D800 through U+DFFF. A pair of surrogates represents a code point.

^{For more information, read Universal Character Set characters. Strictly speaking, Unicode is an extension of UCS, but in this context that distinction doesn't matter.}

So what you are really trying to do here is to figure out how many "ordinary" characters there are in a UTF-16 string. This is done in two steps:

First, normalize the string to NFC form (see Normalizing Text) using the Normalizer API.
Then use the String API to find the number of code points in the string; e.g. use String.codePointCount (javadoc).

In this case, this still fails. The reason is that the code point sequence

ā: 257
́: 769

actually represents an "a" character with two diacritical marks. This cannot be represented as a single Unicode codepoint, so the NFC for it is two codepoints.

What confuses this even further is that a typical renderer will display the "acute" accent over the following character. So it looks like you have a "n acute" in your example.

It is going to be very difficult to deal with pathological examples like this where base characters have multiple diacriticals that might render strangely. Maybe you need to translate to NFD and then count the code points that are not diacriticals.

Thanks, that sounds logical. But `s = Normalizer.normalize(s, Form.NFC);` and `System.out.println(s.codePointCount(0, s.length()));` still gives me `6`. What am i doing wrong? — bkis, Jul 25 '17 at 10:14
Thank you! I tested this with `s = Normalizer.normalize(s, Form.NFD);` and `s = s.replaceAll("\\W","");` and it indeed has a length of 5 now. — bkis, Jul 25 '17 at 11:08

Cannot identify surrogate characters in Java string

1 Answers1