UTF16 BIG ENDIAN to UTF8 conversion for failed for 0xdcf0

Question

I am trying to convert a UTF16 to UTF8. For string 0xdcf0, the conversion failed with invalid multi byte sequence. I don't understand why the conversion fails. In the library I am using to do utf-16 to utf-8 conversion, there is a check

if (first_byte & 0xfc == 0xdc)  {
   return -1;
}

Can you please help me understand why this check is present.

Possible duplicate of [Python 2.7: Strange Unicode behavior](https://stackoverflow.com/questions/53140775/python-2-7-strange-unicode-behavior) — phuclv, Aug 18 '19 at 05:43
[Is it possible to construct a unicode string that the utf-8 codec cannot encode?](https://stackoverflow.com/q/41231414/995714), [How to support surrogate characters in utf8](https://stackoverflow.com/q/42556605/995714), [What are surrogate characters in UTF-8?](https://stackoverflow.com/q/51001150/995714) — phuclv, Aug 18 '19 at 05:44

score 2 · Answer 1 · answered Aug 18 '19 at 02:21

Unicode characters in the DC00–DFFF range are "low" surrogates, i.e. are used in UTF-16 as the second part of a surrogate pair, the first part being a "high" surrogate character in the range D800–DBFF.

See e.g. Wikipedia article UTF-16 for more information.

The reason you cannot convert to UTF-8, is that you only have half a Unicode code point.

score 1 · Answer 2 · edited Aug 18 '19 at 03:37

In UTF-16, the two byte sequence

DCFO

cannot begin the encoding of any character at all.

The way UTF-16 works is that some characters are encoded in 2 bytes and some characters are encoded in 4 bytes. The characters that are encoded with two bytes use 16-bit sequences in the ranges:

0000 .. D7FF
E000 .. FFFF

All other characters require four bytes to be encoded in UTF-16. For these characters the first pair of bytes must be in the range

D800 .. DBFF

and the second pair of bytes must be in the range

DC00 .. DFFF

This is how the encoding scheme is defined. See the Wikipedia page for UTF-16.

Notice that the FIRST sixteen bits of an encoding of a character can NEVER be in DC00 through DFFF. It is simply not allowed in UTF-16. This is (if you follow the bitwise arithmetic in the code you found), exactly what is being checked for.

Thanks for the edit, @JonathanLeffler! Wondering also if one could use the term "hextet"...it doesn't seem widely used outside of talking about IPv6 but does apply here too. — Ray Toal, Aug 18 '19 at 04:29

UTF16 BIG ENDIAN to UTF8 conversion for failed for 0xdcf0

2 Answers2