2

java internal encoding for chars are UTF-16 right? While all ASCII uses 2 bytes encoding, then I expect:

     String h="hello"; 
     System.out.println(h.codePointCount(0,h.length())); 
     System.out.println(h.length()); 

to print 10 and 5, But in fact it prints 5, 5.

Where did I get wrong?

Troskyvs
  • 7,537
  • 7
  • 47
  • 115
  • There is the answer for this question https://stackoverflow.com/questions/5078314/isnt-the-size-of-character-in-java-2-bytes – Centos Nov 20 '18 at 12:16
  • `codePointCount` basically is a more exact version of `length` that works correctly for surrogate pairs. For ASCII characters (more generally BMP characters) there is no difference. – Henry Nov 20 '18 at 12:41

1 Answers1

2

Try

String h="hell";
System.out.println(h.codePointCount(0,h.length())); 
System.out.println(h.length());

it prints 5, 6.

'' is presented by two code units, each of 'h', 'e', 'l', 'l' - by one.

And about UTF-16: "The encoding is variable-length, as code points are encoded with one or two 16-bit code units..."

  • In case it's not clear, the question arises from confusing Unicode _codepoints_ with UTF-16 _code units_. Codepoints are not encoded. – Tom Blodget Nov 20 '18 at 14:43