2

My question arises from this answer, which says:

Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit.

If that's correct, then why is "ℤ".getBytes(StandardCharsets.UTF_8).length == 3 and "ℤ".getBytes(StandardCharsets.UTF_16).length == 4?

Community
  • 1
  • 1
Kevin Krumwiede
  • 9,868
  • 4
  • 34
  • 82
  • For UTF-8 you might want to [read a little more about it](https://en.wikipedia.org/wiki/UTF-8). Three bytes is the correct number. Why did you expect anything else? What *did* you expect? – Some programmer dude Oct 19 '16 at 06:50
  • 1
    That's the difference between Unicode and UTF-8. 0x2124 is only the 'sequence number' in the Unicode table. How it's encoded, often in UTF-8, takes up three bytes. – MC Emperor Oct 19 '16 at 06:50
  • actually what is happening is that when calling getBytes it takes the string format and calculate the ascii amount of chars...internal strlen calling from c lib. so 3bytes = 7*3=~ 21 ascii bits which means = 21/8 = 2.625bytes so 4bytes = 7*4=~ 28 ascii bits 28/8 = 3.5bytes and thats why it shows, 3bytes and 4bytes because it is actually celling the length because of format size differences. ^.^ cool hey! lol – Dean Van Greunen Oct 19 '16 at 07:03
  • 4
    `getBytes(StandardCharsets.UTF_16)` generates BOM(byte order mark) 0xFEFF. So the length is 4. `"ℤ".getBytes(StandardCharsets.UTF_16BE)` is 2 bytes. –  Oct 19 '16 at 07:11
  • Looking at the answers, I'd say this question has very little to do with Java, maybe remove the tag? – walen Oct 19 '16 at 08:18
  • @saka1029: same with `StandardCharsets.UTF_16LE` – Remy Lebeau Oct 20 '16 at 23:40

2 Answers2

6

It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).

0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a bunch of 'sequence numbers' mapped to certain characters. Such a sequence number is called a code point, and it's often written down as a hexadecimal number.

How that certain number is encoded, might take up more bytes than the raw code point would.


Short calculation of UTF-8 encoding of given character:
To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain amount (lets call it N) of 1 bits followed by a 0 bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits 10.

Hex 0x2124 = binary 100001 00100100

According to abovementioned rules, this converts to the following UTF-8 encoding:

11100010 10000100 10100100    <-- Our UTF-8 encoded result
^   ^ ^  ^ ^      ^ ^
AaaaBbDd CcDddddd CcDddddd    <-- Some notes, explained below
  • A is a set of ones followed by a zero, which denote the number of bytes belonging to this character (three 1s = three bytes).
  • B is padding, because otherwise the total number of bits is not divisible by 8.
  • C is the concatenation bits (each subsequent byte starting with 10).
  • D is the actual bits of our code point.

So indeed, the character ℤ takes up three bytes.

MC Emperor
  • 22,334
  • 15
  • 80
  • 130
0

Not all characters in the BMP are encoded using two bytes in UTF-8. Characters from U+4016 are encoded using 3 bytes, and from U+38E2E using 4 bytes.

The full table can be found in the Wikipedia article on UTF-8:

https://en.wikipedia.org/wiki/UTF-8

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Elias Mårtenson
  • 3,820
  • 23
  • 32