encodings - different result between codePointCount and length

Question

I found one tricky place and couldn't find any answer why this exactly happen.

The main problem is how long is string.

Whether it contains one or two character.

Code:

public class App {
    public static void main(String[] args) throws Exception {
        char ch0 = 55378;
        char ch1 = 56816;
        String str = new String(new char[]{ch0, ch1});
        System.out.println(str);
        System.out.println(str.length());
        System.out.println(str.codePointCount(0, 2));
        System.out.println(str.charAt(0));
        System.out.println(str.charAt(1));
    }
}

Output:

?
2
1
?
?

Any suggestions?

I suggest you to take some time out going through [this article](http://kunststube.net/encoding/) — Rohit Jain, Nov 23 '13 at 12:24

score 2 · Accepted Answer · answered Nov 23 '13 at 12:33

Whether it contains one or two character.

It contains one Unicode character, which is comprised of 2 UTF-16 code units. Every char in Java is a UTF-16 code unit... it may not be a whole character. Each character has a single code point - Unicode provides a coded character set mapping each character to an integer representing that character (the code point).

length() returns the number of code units, whereas codePointCount returns the number of code points.

You may want to look at my article about encodings in .NET - the terminology all translates fine (as it's standard terminology), so just ignore the .NET-specific parts.

It's exactly what I was looking for – catch23 Nov 24 '13 at 09:23 — catch23, Nov 24 '13 at 09:23

encodings - different result between codePointCount and length

1 Answers1

Linked