I am having trouble identifying surrogate characters in strings like devā́n
. I read the relevant questions concerning the topic here on SO, but something is still wrong with this...
As you see, the "natural" length (i just made up that expression) of this string is 5, but "devā́n".length()
gives me 6.
That is fine, because ā́
consists of two characters internally (it's not withing the UTF-16 code range). But i would like to get the length of the string as you'd read it or as it's printed, so 5
in this case.
I tried identifying the weirdo chars with the following tricks found here and here, but it doesn't work and i'm always getting 6. Just have a look at this:
//string containing surrogate pair
String s = "devā́n";
//prints the string properly
System.out.println("String: " + s);
//prints "Length: 6"
System.out.println("Length: " + s.length());
//prints "Codepoints: 6"
System.out.println("Codepoints: " + s.codePointCount(0, s.length()));
//false
System.out.println(
Character.isSurrogate(s.charAt(3)));
//false
System.out.println(
Character.isSurrogate(s.charAt(4)));
//six code points
System.out.println("\n");
for (int i = 0; i < s.length(); i++) {
System.out.println(s.charAt(i) + ": " + s.codePointAt(i));
}
Is it maybe possible that ā́
is not a valid pair of surrogate chars? How can i identify such a compound char and count it as only one?
BTW the output of above code is
String: devā́n
Length: 6
Codepoints: 6
false
false
d: 100
e: 101
v: 118
ā: 257
́: 769
n: 110