Unicode maps each character* to an integer “code point”. Valid code points are U+0000 through U+10FFFF, allowing for more than a million characters (although most of these aren't assigned yet).
(* It's a bit more complicated than that because there are “combining characters” where one user-perceived character can be represented by more than one code point. And some characters have both pre-composed and decomposed representations. For example, the Spanish letter ñ
can be represented either as the single code point U+00F1, or as the sequence U+006E U+0303 (n
+ combining tilde).)
There are three different encoding forms (not counting offbeat ones like UTF-9 and UTF-18) that can be used to represent Unicode characters in a string.
UTF-32 is the most straightforward one: Each code point is represented by a 32-bit integer. So, for example:
A
(U+0041) = 0x00000041
ñ
(U+00F1) = 0x000000F1
४
(U+096A) = 0x0000096A
(U+1F4AA) = 0x0001F4AA
While simple, UTF-32 uses a lot of memory (4 bytes for every character), and is rarely used.
UTF-16 uses 16-bit code units. Characters U+0000 through U+FFFF (the “Basic Multilingual Plane”) are represented straightforwardly as a single code unit, while characters U+10000 through U+10FFFF are represented as a “surrogate pair”. Specifically, you subtract 0x10000 from the code point (resulting in a 20-bit number), and use these bits to fill out the binary sequence 110110xxxxxxxxxx 110111xxxxxxxxxx. For example,
A
(U+0041) = 0x0041
ñ
(U+00F1) = 0x00F1
४
(U+096A) = 0x096A
(U+1F4AA) = 0xD83D 0xDCAA
In order for this system to work, the code points U+D800 through U+DFFF are permanently reserved for this UTF-16 surrogate mechanism and will never be assigned to “real” characters.
It's a backwards-compatibility “hack” to allow the full 17-“plane” Unicode code space to be represented on 1990's-era platforms that were designed with the expectation that Unicode characters would always be 16-bits. This includes Windows NT, Java, and JavaScript.
UTF-8 represents Unicode code points with sequences of 1-4 bytes. Specifically, each character is represented with the shortest of:
- 0xxxxxxx
- 110xxxxx 10xxxxxx
- 1110xxxx 10xxxxxx 10xxxxxx
- 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
So, with the example from earlier:
A
(U+0041) = 0x41
ñ
(U+00F1) = 0xC3 0xB1
४
(U+096A) = 0xE0 0xA5 0xAA
(U+1F4AA) = 0xF0 0x9F 0x92 0xAA
This encoding has the property that the number of bytes in the sequence can be determined from the value of the first byte. Furthermore, leading bytes can be easily distinguished from continuation bytes:
- 0xxxxxxx = single-byte character (ASCII-compatible)
- 10xxxxxx = continuation byte of 2-, 3-, or 4-byte character
- 110xxxxx = lead byte of 2-byte character
- 1110xxxx = lead byte of 3-byte character
- 11110xxx = lead byte of 4 byte character
- 11111xxx = not used