Unicode character length in bytes - always the same?

Question

I defined a unicode character as a byte array:

private static final byte[] UNICODE_MEXT_LINE = Charsets.UTF_8.encode("\u0085").array();

At the moment byte array length is 3, is it safe to assume the length of the array is always 3 across platforms?

Thank you

score 3 · Answer 1 · answered Dec 04 '14 at 02:19

It's safe to assume that that particular character will always be three bytes long, regardless of platform.

But unicode characters in UTF-8 can be one byte, two bytes, three bytes or even four bytes long, so no, you can't assume that if you convert any character to UTF-8 then it'll come out as three bytes.

score 1 · Accepted Answer · edited May 23 '17 at 12:21

That particular character will always be 3 bytes in length, but others will be different. Unicode characters are anywhere from 1-4 bytes long. The 8 in 'UTF-8' just means that it uses 8-bit code units.

The Wikipedia page on UTF-8 provides a pretty good overview of how that works. Basically, the first bits of the first byte tell you how many bytes long that character will be. For instance, if the first bit of the first byte is a 0 as in 01111111, then that means this character is only one byte long (in utf-8, these are the ascii characters). If the first bits are 110 as in 11011111, then that tells you that this character will be two bytes long. The chart in the Wikipedia page provides a good illustration of this.

There's also this question, which has some good answers as well.

Unicode character length in bytes - always the same?

2 Answers2