Why were the code points in the range of U+D800 to U+DFFF removed from the Unicode character set?

Question

I am learning about UTF-16 encoding, and I have read that if you want to represent code points in the range of U+10000 to U+10FFFF, then you have to use surrogate pairs, which are in the range of U+D800 to U+DFFF.

So let's say I want to encode the following code point: U+10123 (10000000100100011 in binary):

First I layout this sequence of bits:

110110xxxxxxxxxx 110111xxxxxxxxxx

Then I fill the places with the x with the binary format of the code point:

1101100001000000 1101110100100011 (D840 DD23 in hexadecimal)

I have also read that the code points in the range of U+D800 to U+DFFF were removed from the Unicode character set, but I don't understand why this range was removed!

I mean this range can be easily encoded in 4 bytes, for example the following is the UTF-16 encoded format of the U+D812 code point (1101100000010010 in binary):

1101100000110110 1101110000010010 (D836 DC12 in hexadecimal)

Note: I was using UTF-16 Big Endian in my examples.

Are you sure that U+10123 becomes `D840 DD23` and not `D800 DD23`? — Roland Illig, Oct 21 '16 at 20:32
@Roland Illig It's weird, when I encode it manually I get `D840 DD23`, but when I encode it using this online tool: https://r12a.github.io/apps/conversion/, I get `D800 DD23`. Is my manual encoding method wrong? — paul, Oct 21 '16 at 20:59
Let's call it reserved instead of removed. Aids in error detection, you can tell who got it wrong and file the bug report in the right place. Other examples are U+FFFE (matches backwards BOM) and U+FFFF (too easy to get wrong as C end-of-file). — Hans Passant, Oct 21 '16 at 21:38
@Roland Illig You are right, `D800 DD23` is the correct answer, what I did wrong is that I forgot to subtract `0x10000` from the code point (this should have been the first step I make). — paul, Oct 21 '16 at 22:24

score 11 · Answer 1 · answered Oct 22 '16 at 02:00

Codepoints U+D800 - U+DFFF are reserved exclusively¹ for use with UTF-16. Since they are not in the range of U+10000 - U+10FFFF, UTF-16 would not encode them individually using surrogate pairs, so it would be ambiguous (and illegal²) for these individual codepoints to appear un-encoded in a UTF-16 sequence.

Per the Unicode.org UTF-16 FAQ:

¹: Q: What are surrogates?

A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D800₁₆ to DBFF₁₆, and trailing, or low, surrogates are from DC00₁₆ to DFFF₁₆. They are called surrogates, since they do not represent characters directly, but only as a pair.

²: Q: Are there any 16-bit values that are invalid?

A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800₁₆ to DBFF₁₆ not followed by a value in the range DC00₁₆ to DFFF₁₆, or any value in the range DC00₁₆ to DFFF₁₆ not preceded by a value in the range D800₁₆ to DBFF₁₆.

Your wording is mixing encoded code points with code points. They can not be unencoded in a UTF-16 stream. There is no way to encode them in UTF-16. UTF-8 and UTF-32 can encode them, but as you pointed out, they are reserved and should not be encoded. — Gerard ONeill, Jan 12 '22 at 22:23

score 3 · Answer 2 · answered Oct 21 '16 at 20:33

I don't have an official source to back this up, but I believe it was to prevent confusion, so that you couldn't get a code sequence that could be interpreted as both valid UTF-16 and valid UCS-2. The loss of 2048 code points was nothing compared to the addition of 1048576 new ones.

score 2 · Answer 3 · answered Oct 21 '16 at 20:37

2

Since encoding a code point as surrogate pair starts by subtracting 0x010000, this would lead to negative numbers. And the point of this subtraction is to allow 65536 more code points instead of encoding the left-out 2048. Maybe this will prove useful, should the whole code space be used up in a distant future.

answered Oct 21 '16 at 20:37

Roland Illig

40,703
10
88
121

Only codepoints U+10000 - U+10FFFF are subtracted and encoded to surrogate pairs. Codepoints U+0000 - U+FFFF are encoded as-is, not subtracted. – Remy Lebeau Oct 22 '16 at 02:05
@Remy That's why I said "encoding a code point _as surrogate pair_". – Roland Illig Oct 22 '16 at 06:36

Why were the code points in the range of U+D800 to U+DFFF removed from the Unicode character set?

3 Answers3

Linked