UTF-8 encoded string's byte count isn't as expected

Question

I am not able to understand this: Why does the given code print out 12 and not 11 altough hello world has only 11 characters?

byte[] byteArray = Charset.forName("UTF-8").encode("hello world").array();
System.out.println(byteArray.length);

When you debug this, what bytes are in the array? What characters in UTF-8 do they map to? — David, Sep 21 '16 at 17:42
There is "End of String" special character after it. I guess. — antonio081014, Sep 21 '16 at 17:43

score 7 · Answer 1 · answered Sep 21 '16 at 17:52

The array method of ByteBuffer returns the array backing the buffer, but not all bytes are significant. Only the bytes up to limit are used. The following returns 11 as expected:

int limit = Charset.forName("UTF-8").encode("hello world").limit();
System.out.println(limit);

score 3 · Answer 2 · answered Sep 21 '16 at 17:54

3

Easy to see if you debug the array:

b=68, char=h
b=65, char=e
b=6C, char=l
b=6C, char=l
b=6F, char=o
b=20, char= 
b=77, char=w
b=6F, char=o
b=72, char=r
b=6C, char=l
b=64, char=d
b=0, char=

So last character is \u0000

answered Sep 21 '16 at 17:54

rmuller

12,062
4
64
92

score 1 · Answer 3 · edited May 23 '17 at 10:27

1

I'm not sure what you are trying to accomplish, but to get the byte array of a string, why not just use:

String s = "hello world";
byte[] b = s.getBytes("UTF-8");

assertEquals(s.length(), b.length);

More information in this answer:

How to convert Strings to and from UTF8 byte arrays in Java

edited May 23 '17 at 10:27

Community

1
1

answered Sep 21 '16 at 17:52

Francisco C.

687
10
14

score 0 · Answer 4 · answered Sep 21 '16 at 17:57

Using this program, you can figure out what bytes the byte array contains:

byte[] byteArray = Charset.forName("UTF-8").encode("hello world").encoded.array();
for(int i = 0; i < byteArray.length; i++) {
    System.out.println(byteArray[i]+" - "+((char)byteArray[i]));
}

The bytes are (decimal):

104 101 108 108  111 32 119 111  114 108 100 0

The first 11 characters are the UTF-8 encoded string hello world, as expected. The last byte is the Null character, which is used to represent nothing at all.

To deal with this, just use the .limit() method of ByteBuffer as mentioned above.

UTF-8 encoded string's byte count isn't as expected

4 Answers4