Confusion over UTF-16 and UTF-32

Question

From what I understand, the main difference between UTF-16 and UTF-32 is that UTF-32 is always four bytes per character, while UTF-16 is sometimes one byte and sometimes two bytes per character. This gives UTF-16 the advantage of taking up less memory than UTF-32, but UTF-32 has the advantage of constant time access of the n'th character.

My question is, if you can represent every unicode character with at most two bytes as done in UTF-16, then why isn't there a format that always uses two bytes to encode each character? This format, while being slightly more memory expensive than UTF-16, would be strictly better than UTF-32 by allowing constant time access while using half the memory.

What is my misunderstanding here?

UTF-16 codeunits are 16 bits wide (2 octets (bytes on most modern hardware)). A whole codepoint uses either one or two 16-bit codeunits, so a maximum of 32 bit like UTF-32. If you want a more space-efficient, easier handled and byte-order-agnostic encoding, look at UTF-8. — Deduplicator, May 29 '14 at 13:35

score 6 · Accepted Answer · edited May 30 '14 at 01:57

You got it a bit wrong:

Unicode defines values (code points) up to 0x110000, i.e. 2²¹. Once 0x10FFFF has been reached, new encoding schemes will be needed, but there is tons of unused code points so Unicode has plenty of room to expand for the foreseeable future before hitting that limit.
UTF-32 uses 32-bit code units. Since every code point currently defined is less than 0x10FFFF, every code point fits in 1 code unit.
UTF-16 uses 16-bit code units. Its encoding scheme uses 1 code unit for code points below 0x10000, and two code units (known as a surrogate pair) for the remaining code points. UTF-16 is designed to encode code points up to 0x10FFFF.
UTF-8 uses 8-bit code units. Its encoding scheme uses anywhere between 1-4 code units to represent a code point, depending on its value. The original encoding scheme used to allow up to 6 code units for code points up to 0x7FFFFFFF, but was later restricted to 4 code units so that code points above 0x10FFFF, which are not representable in UTF-16, are illegal in UTF-8 to allow for loss-less conversions between UTF-8 and UTF-16.

score 1 · Answer 2 · answered May 29 '14 at 13:35

1

UTF-16 uses two bytes for characters in Plane 0, the Basic Multilingual Plane (BMP), U+0000...U+FFFF, and four bytes for any other character. You cannot represent all Unicode characters in two bytes.

answered May 29 '14 at 13:35

Jukka K. Korpela

195,524
37
270
390

depends on byte == octet. – Deduplicator May 29 '14 at 13:40
@Deduplicator: That's a safe assumption considering that the *vast* majority of computers these days use 8-bit bytes, and the exceptions tend to be things like digital signal processors that aren't concerned with representing Unicode text. Nevertheless, if you do happen to have a system with bytes greater than 8 bits, you could encode text in [UTF-9 or UTF-18](http://tools.ietf.org/html/rfc4042). – dan04 May 29 '14 at 13:47

score 1 · Answer 3 · edited May 30 '14 at 01:48

why isn't there a format that always uses two bytes to encode each character?

There is; it's called UCS-2.

The problem is, a straight 16-bit format only lets you represent 2¹⁶ = 65 536 code points. This was enough for Unicode 1.0 (whose goal was “to encompass the characters of all the world's living languages”), but then the scope of the project expanded to include historical scripts like Egyptian hieroglyphs, and the 16-bit limit became too limiting.

So, the Unicode Consortium decided to add 16 supplementary planes with room for a million new characters, expanding the upper limit of the code space from U+FFFF to U+10FFFF. Simultaneously, the “surrogate pair” mechanism of UTF-16 was invented so that platforms which had already been built around UCS-2 (notably, Windows NT and the Java programming language) could represent the additional code points.

To expand on @Deduplicator's comment, there are >74K CJKV characters in current versions of Unicode, so at least some of those characters must be represented by more than two 8-bit bytes. — outis nihil, May 29 '14 at 19:44

Confusion over UTF-16 and UTF-32

3 Answers3