UTF-8 string length returns 2 despite the string consists of a single char ''

Question

Why does strLen equal 2 despite the string consists of a single char ''?

    byte[] bytesChar = {(byte)240, (byte)144, (byte)141,(byte)137};
    String chars = new String(bytesChar, StandardCharsets.UTF_8);
    int strLen = chars.length();

`Char` is already a `String`, it makes no sense to write `""+Char`. — I keep seeing this pattern on here recently, is this taught somewhere? — Konrad Rudolph, Dec 22 '22 at 11:05
@Konrad Rudolph Thank you! Indeed! It just my wrong self education — newman, Dec 22 '22 at 11:07
@NewMan No worries, and it’s a mistake that can honestly happen to everyone. I was just confused because I keep seeing this increasingly frequently — Konrad Rudolph, Dec 22 '22 at 11:08
I've voted to reopen this, since the suggested duplicate https://stackoverflow.com/questions/15947992/java-unicode-string-length actually talks about **another** reason why the number of visible characters don't match the result of `length`: The dupe is about multiple Unicode codepoints combining to make a single "visible character". It's likely that there **is** a correct dupe target on Stackoverflow, but this is not it. — Joachim Sauer, Dec 22 '22 at 12:02
I have examined the above-mentioned refference. Now I have another question. I have saved the char 'கு' in Notepad (UTF-8 withowt BOM) and now I see that the saved file has length of 6 bytes. Why is it so? As far as I know the UTF-8 symbols are less or equal 4 bytes... — newman, Dec 22 '22 at 12:28
@NewMan: that issue is exactly the one that's explained in the question I linked to above: Some "characters" are actually multiple codepoints that combine to produce a singular visible character. See [here](https://www.fontspace.com/unicode/analyzer#e=4K6V4K-B) for how this character is construced. It's two Unicode codepoints, both of which require 3 bytes each in UTF-8. In other words: please make sure to actually read the *answers* on that other question, they explain in quite some detail. — Joachim Sauer, Dec 22 '22 at 12:55
[This question explains the basic principles of code points, code units, graphemes, glyphs, ... quite well](https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme). — Joachim Sauer, Dec 22 '22 at 12:59

score 3 · Accepted Answer · answered Dec 22 '22 at 11:18

is U+10349.

As the 5-digit Unicode number indicates, it's outside of the Basic Multilingual Plane, which is the set of Unicode characters that can be represented in 16 bits.

Java strings are encoded using UTF-16, so this character requires two 16 bit code units (chars) to be represented in a String. Specifically it will be represented using the char values 0xD800 and 0xDF49.

For backwards compatibility reasons String.length returns the number of code units (i.e. char values) needed to make up the String and not the number of Unicode codepoints.

The reason this kind of problem doesn't show up more often is that the majority of frequently used characters are in the BMP and are therefore represented by one code unit. The most common exception to this are some Emojis.

UTF-8 string length returns 2 despite the string consists of a single char ''

1 Answers1