If 'ℤ' is in the BMP, why isn't it encoded in 2 bytes?

Question

My question arises from this answer, which says:

Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit.

If that's correct, then why is "ℤ".getBytes(StandardCharsets.UTF_8).length == 3 and "ℤ".getBytes(StandardCharsets.UTF_16).length == 4?

For UTF-8 you might want to [read a little more about it](https://en.wikipedia.org/wiki/UTF-8). Three bytes is the correct number. Why did you expect anything else? What *did* you expect? — Some programmer dude, Oct 19 '16 at 06:50
That's the difference between Unicode and UTF-8. 0x2124 is only the 'sequence number' in the Unicode table. How it's encoded, often in UTF-8, takes up three bytes. — MC Emperor, Oct 19 '16 at 06:50
actually what is happening is that when calling getBytes it takes the string format and calculate the ascii amount of chars...internal strlen calling from c lib. so 3bytes = 7*3=~ 21 ascii bits which means = 21/8 = 2.625bytes so 4bytes = 7*4=~ 28 ascii bits 28/8 = 3.5bytes and thats why it shows, 3bytes and 4bytes because it is actually celling the length because of format size differences. ^.^ cool hey! lol — Dean Van Greunen, Oct 19 '16 at 07:03
`getBytes(StandardCharsets.UTF_16)` generates BOM(byte order mark) 0xFEFF. So the length is 4. `"ℤ".getBytes(StandardCharsets.UTF_16BE)` is 2 bytes. — , Oct 19 '16 at 07:11
Looking at the answers, I'd say this question has very little to do with Java, maybe remove the tag? — walen, Oct 19 '16 at 08:18

MC Emperor · Accepted Answer · 2017-12-28T13:25:14.230

It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).

0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a bunch of 'sequence numbers' mapped to certain characters. Such a sequence number is called a code point, and it's often written down as a hexadecimal number.

How that certain number is encoded, might take up more bytes than the raw code point would.

Short calculation of UTF-8 encoding of given character:
To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain amount (lets call it N) of 1 bits followed by a 0 bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits 10.

Hex 0x2124 = binary 100001 00100100

According to abovementioned rules, this converts to the following UTF-8 encoding:

11100010 10000100 10100100    <-- Our UTF-8 encoded result
^   ^ ^  ^ ^      ^ ^
AaaaBbDd CcDddddd CcDddddd    <-- Some notes, explained below

A is a set of ones followed by a zero, which denote the number of bytes belonging to this character (three 1s = three bytes).
B is padding, because otherwise the total number of bits is not divisible by 8.
C is the concatenation bits (each subsequent byte starting with 10).
D is the actual bits of our code point.

So indeed, the character ℤ takes up three bytes.

score 0 · Answer 2 · edited Oct 20 '16 at 23:41

0

Not all characters in the BMP are encoded using two bytes in UTF-8. Characters from U+4016 are encoded using 3 bytes, and from U+38E2E using 4 bytes.

The full table can be found in the Wikipedia article on UTF-8:

https://en.wikipedia.org/wiki/UTF-8

edited Oct 20 '16 at 23:41

Remy Lebeau

555,201
31
458
770

answered Oct 19 '16 at 06:53

Elias Mårtenson

3,820
23
32

If 'ℤ' is in the BMP, why isn't it encoded in 2 bytes?

2 Answers2

Linked