What are the rules for reading a UTF-16 byte stream, to determine how many bytes a character takes up? I've read the standards, but based on empirical observations of real-world UTF-16 encoded streams, it looks like there are certain where the standards don't hold true (or there's an aspect of the standard that I'm missing).
From the reading the UTF-16 standard https://www.rfc-editor.org/rfc/rfc2781:
Value of leading 2 bytes | Resulting character length (bytes) |
---|---|
0x0000-0xC7FF |
2 |
0xD800-0xDBFF |
4 |
0xDC00-0xDFFF |
Invalid sequence (RFC2781 2.2.2) |
0xDFFF-0xFFFF |
4 |
In practice, this appears to hold true, for some cases at least. Using an ad-hoc SQL script (SQL Server 2019; UTF-16 collation), but also verified with an online decoder:
Character | Unicode Name | ISO 10646 | UTF-16 Encoding (hexadecimal, big endian) | Size (bytes) |
---|---|---|---|---|
A | LATIN CAPITAL LETTER A | U+0041 | 00 41 |
2 |
Б | CYRILLIC CAPITAL LETTER BE | U+0411 | 04 11 |
2 |
ァ | KATAKANA LETTER SMALL A | U+30A1 | 30 A1 |
2 |
RABBIT FACE | U+1F430 | D8 3D DC 30 |
4 |
However when encoding the following ISO 10646 character into UTF-16, it appears to be 4 bytes, but reading the leading 2 bytes appears to give no indication that it will be this long:
Character | Unicode Name | UTF-16 Encoding (hexadecimal, big endian) | Size (bytes) |
---|---|---|---|
⚕️ | STAFF OF AESCULAPIUS | 26 95 FE 0F |
4 |
Whilst I'd rather keep my question software-agnostic; the following SQL will reproduce this behaviour on Microsoft SQL Server 2019, with default collation and default language. (Note that SQL Server is little endian).
select cast(N'⚕️' as varbinary);
----------
0x95260FFE
Quite simply, how/why do you read 0x2695
and think "I'll need to read in the next word for this character."? Why doesn't this appear to align with the published UTF-16 standard?