Why does strLen equal 2 despite the string consists of a single char ''?
byte[] bytesChar = {(byte)240, (byte)144, (byte)141,(byte)137};
String chars = new String(bytesChar, StandardCharsets.UTF_8);
int strLen = chars.length();
Why does strLen equal 2 despite the string consists of a single char ''?
byte[] bytesChar = {(byte)240, (byte)144, (byte)141,(byte)137};
String chars = new String(bytesChar, StandardCharsets.UTF_8);
int strLen = chars.length();
is U+10349.
As the 5-digit Unicode number indicates, it's outside of the Basic Multilingual Plane, which is the set of Unicode characters that can be represented in 16 bits.
Java strings are encoded using UTF-16, so this character requires two 16 bit code units (char
s) to be represented in a String
. Specifically it will be represented using the char
values 0xD800 and 0xDF49.
For backwards compatibility reasons String.length
returns the number of code units (i.e. char
values) needed to make up the String and not the number of Unicode codepoints.
The reason this kind of problem doesn't show up more often is that the majority of frequently used characters are in the BMP and are therefore represented by one code unit. The most common exception to this are some Emojis.