Why UTF-8 encoding does not use bytes of the form 11111xxx as the first byte?

Question

According to https://en.wikipedia.org/wiki/UTF-8, the first byte of the encoding of a character never start with bit patterns of neither 10xxxxxx nor 11111xxx. The reason for the first one is obvious: auto-synchronization. But how about the second? Is it for something like potential extension to enable 5-bytes encoding?

score 6 · Accepted Answer · answered Feb 22 '19 at 16:28

6

Older versions of UTF-8 allowed up to 6-byte encodings. It was later restricted to 4-byte encodings, but there's no reason to make the format inconsistent in order to achieve that restriction. The number of leading 1s indicates the length of the sequence, so 11111xxx still means "at least 5 bytes," there just are no such legal sequences.

Having illegal code points is very useful in detecting corruption (or more commonly, attempts to decode data that is not actually UTF-8). So making the format inconsistent just to get back one bit of storage (which couldn't actually be used for anything), would hurt other goals.

answered Feb 22 '19 at 16:28

Rob Napier

286,113
34
456
610

5

To be precise, Unicode's code point range was restricted to U+10FFFF in order to guarantee that all characters could be encoded [in UTF-16](https://stackoverflow.com/a/280182/287586). This had the side effect of making the bytes 0xF5-0xFD unused in UTF-8. (0xFE and 0xFF were *never* valid UTF-8 lead bytes, which guarantees that a UTF-8 character will never be confused with UTF-16's byte-order mark.) The other illegal bytes in UTF-8 are 0xC0 and 0xC1, which would only occur in “overlong” encodings of ASCII characters. – dan04 Feb 22 '19 at 16:42
@dan04 Thanks! Can you elaborate more on "overlong" encodings of ASCII chars? – Junekey Jeon Feb 22 '19 at 16:53
1

See https://stackoverflow.com/questions/7113117/what-is-exactly-an-overlong-form-encoding for a quick explanation on "overlong" encodings. Basically it's possible to encode values with more bytes than required by adding leading zeros, but that's explicitly forbidden. – Rob Napier Feb 22 '19 at 18:46

Why UTF-8 encoding does not use bytes of the form 11111xxx as the first byte?

1 Answers1