4

I've seen that >2 byte unicode codepoints like U+10000 can be written as a pair, like \uD800\uDC00. They seem to start with the nibble d, but that's all I've noticed.

What is that splitting action called and how does it work?

Community
  • 1
  • 1
user193661
  • 879
  • 10
  • 29
  • It's called surrogate pairs. See: https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx – Shannon Severance Nov 11 '15 at 00:45
  • The involvement of UTF16 is what's confusing me. Because I was thinking that UTFs just convert codepoints to bytestrings. – user193661 Nov 11 '15 at 00:47
  • 3
    A UTF defines how codepoints are encoded as codeunits, which can be 7bit, 8bit, 16bit, or 32bit in size, depending on the UTF. UTF-8 encodes a codepoint using 1, 2, 3, or 4 8bit codeunits, depending on the codepoint's value. UTF-16 encodes a codepoint using 1 or 2 16bit codeunits, where 2 codeunits acting together is known as a surrogate pair. Each UTF defines its own algorithm for converting a codepoint into a sequence of codeunits, and vice versa. Read the UTF specs. – Remy Lebeau Nov 11 '15 at 00:56
  • 2
    In a nutshell, for UTF-16, if a codepoint is > 0xFFFF, 0x010000 is subtracted from it, leaving 20 bits left. They are divided in half, where the high 10 bits are added to 0xD800 and the low 10 bits are added to 0xDC00. Thus creating a surrogate pair. – Remy Lebeau Nov 11 '15 at 01:05
  • 1
    Unicode started out as a 16 bit standard, through version 3. When it was decided that more bits were needed, a set of code points was set aside to encode higher code points so that systems built around the sixteen bit version would still "work" without major retrofitting. The encoding of a code point into a surrogate pair and the into UTF-16 mirrors the direct coding of a code point into UTF-16. Note UTF-8 used to support supplemental code points encoded as a surrogate pair for a total of 6 bytes. That is no longer supported. See CESU-8, http://www.unicode.org/reports/tr26/ – Shannon Severance Nov 11 '15 at 01:09
  • 1
    @ShannonSeverance: The "sixteen bit version" you refer to was known as UCS-2, not UTF-16. UTF-16 is backwards compatible with UCS-2, but adds surrogate pairs for encoding codepoints that do not fit in UCS-2. – Remy Lebeau Nov 11 '15 at 01:13
  • You're mistaking *code unit* for *code point*. In UTF-16 some code points are represented by 2 *code units*. There's no character code point in that area. [Joel on Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html). [What do I need to know about Unicode?](http://stackoverflow.com/q/222386/995714) – phuclv Nov 11 '15 at 03:04
  • @LưuVĩnhPhúc So that surrogate pair I provided is a UTF16 escape sequence representing an abstraction? – user193661 Nov 11 '15 at 03:19
  • Possible duplicate of [Manually converting unicode codepoints into UTF-8 and UTF-16](http://stackoverflow.com/questions/6240055/manually-converting-unicode-codepoints-into-utf-8-and-utf-16) – Mark Tolonen Nov 11 '15 at 05:44
  • 1
    @Clearquestionwithexamples: a surrogate pair is two 16bit code units working together. The pair you showed in your question is using escape sequences to represent the numeric values of the individual code units. – Remy Lebeau Nov 13 '15 at 02:25

2 Answers2

4

UTF-8 means (using my own words) that the minimum atom of processing is a byte (the code unit is 1-byte long). I don't know if historically, but at least, conceptually spoken, the UCS-2 and UCS-4 Unicode encodings come first, and UTF-8/UTF-16 appear to solve some problems of UCS-*.

UCS-2 means that each character uses 2 bytes instead of one. It's a fixed-length encoding. UCS-2 saves the bytestring of each code point as you say. The problem is there are characters which codepoints require more than 2 bytes to store it. So, UCS-2 only can handle a subset of Unicode (the range U+0000 to U+FFFF of course).

UCS-4 uses 4 bytes for each character instead, and it's capable enough to store the bitstring of any Unicode code point, obviously (the Unicode range is from U+000000 to U+10FFFF).

The problem with UCS-4 is that characters outside the 2-bytes range are very, very uncommon, and any text encoded using UCS-4 will waste too much space. So, using UCS-2 is a better approach, unless you need characters outside the 2-bytes range.

But again, English texts, source code files and so on use mostly ASCII characters and UCS-2 has the same problem: wasting too much space for texts which use mostly ASCII characters (too many useless zeros).

That is what UTF-8 does. Characters inside the ASCII range are saved in UTF-8 texts as-is. It just takes the bitstring of the code point/ASCII value of each character. So, if a UTF-8 encoded text uses only ASCII characters, it is indistinguishable from any other Latin1 encoding. Clients without UTF-8 support can handle UTF-8 texts using only ASCII characters, because they look identical. It's a backward compatible encoding.

From then on (Unicode characters outside the ASCII range), UTF-8 texts use two, three or four bytes to save code points, depending on the character.

I don't know the exact method, but the bitestring is split in two, three or four bytes using known bit prefixes to know the amount of bytes used to save the code point. If a byte begins with 0, means the character is ASCII and uses only 1 byte (the ASCII range is 7-bits long). If it begins with 1, the character is encoded using two, three or four bytes depending on what bit comes next.

The problem with UTF-8 is that it requires too much processing (it must examine the first bits of each character to know its length), specially if the text is not English-like. For example, a text written in Greek will use mostly two-byte characters.

UTF-16 uses two-bytes code units to solve that problem for non-ASCII texts. That means that the atoms of processing are 16-bit words. If a character encoding doesn't fit in a two-byte code unit, then it uses 2 code units (four bytes) to encode the character. That pair of two code units is called a surrogate pair. I think a UTF-16 text using only characters inside the 2-byte range is equivalent to the same text using UCS-2.

UTF-32, in turn, uses 4-bytes code units, as UCS-4 does. I don't know the differences between them though.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
ABu
  • 10,423
  • 6
  • 52
  • 103
  • I believe you said you're confused about how variable width encoding works. I found [this](http://stackoverflow.com/questions/1543613/how-does-utf-8-variable-width-encoding-work) helpful. – user193661 Nov 11 '15 at 01:38
  • 2
    UTF-8 needs maximum 4 bytes to store 21-bit code point, as one code unit store 6 bits of data – phuclv Nov 11 '15 at 03:05
  • 1
    The biggest problem with UCS-4 is not the memory requirements, rather it is the fact that Unicode didn't need that many bits when the major platforms standardized their Unicode implementations. – Mark Ransom Nov 11 '15 at 03:56
  • 1
    Another problem which UTF-8 solves elegantly is that you can tell from every individual byte where in a multi-byte sequence it belongs; with legacy multibyte encodings, you could get out of whack and/or have to scan several surrounding bytes to figure out how to interpret the current byte. – tripleee Dec 15 '20 at 08:06
1

The complete picture filling in your confusion is formatted below:

Referencing what I learned from the comments...


U+10000 is a Unicode code point (hexadecimal integer mapped to a character).

Unicode is a one-to-one mapping of code points to characters.

The inclusive range of code points from 0xD800 to 0xDFFF is reserved for UTF-161 (Unicode vs UTF) surrogate units (see below).

\uD800\uDC002 are two such surrogate units, called a surrogate pair. (A surrogate unit is a code unit that's part of a surrogate pair.)

Abstract representation: Code point (abstract character) --> Code unit (abstract UTF-16) --> Code unit (UTF-16 encoded bytes) --> Interpreted UTF-16

Actual usage example: Input data is bytes and may be wrapped in a second encoding, like ASCII for HTML entities and unicode escapes, or anything the parser handles --> Encoding interpreted; mapped to code point via scheme --> Font glyph --> Character on screen

How surrogate pairs work


Surrogate pair advantages:

  1. There are only high and low units. A high must be followed by a low. No confusing high&low units.
  2. UTF-16 can use 2 bytes for the first 63487 code points because surrogates cannot be mistaken for code points.
  3. A range of 2048 code points is (2048/2)**2 to yield a range of 1048576 code points.
  4. The processing is done on the less frequently used characters.

1 UTF-16 is the only UTF which uses surrogate pairs.
2 This is formatted as a unicode escape sequence.


Graphics describing character encoding:

enter image description here

enter image description here


Keep reading:

Community
  • 1
  • 1
user193661
  • 879
  • 10
  • 29