As an analogy, suppose you want to write a long text onto multiple pages, and you need to know exactly when the text ends. Then you would probably reserve a small part of the lower right corner for a note that says either “the text continues on the next page” or “the text ends here”. Example:
page 1: This is a very [the text continues on the next page]
page 2: long text that [the text continues on the next page]
page 3: does not fit [the text continues on the next page]
page 4: on one page. [the text ends here]
It should be obvious that the lower right corner of the page cannot be used for the normal text, since it is already used by the continuation marker.
A very similar technique is used by UTF-8 when converting a sequence of bytes into a sequence of code points. The rules are:
- If the first byte of the sequence is between 0 and 127, its value is the code point.
- If the first byte of the sequence is between 128 and 191, it is an error.
- If the first byte of the sequence is between 192 and 255, it belongs to a sequence of several bytes, and some bits of these bytes are used to calculate the code point. The following bytes must be between 128 and 191.
This means that the highest bit of each byte works as the marker that says “this byte is part of a multi-byte code point sequence”. Because this bit has this meaning and cannot have any other meaning, only the code points from 0 to 127 can be represented using one byte. All other code points need more than one byte.
UTF-8 is not the only possibility for storing Unicode code points in a sequence of bytes. You could also define an encoding with these rules:
- If the first byte is between 0 and 253, it represents its code point.
- If the first byte is 254, the two following bytes are used for code points in the range 254 to 65535.
- If the first byte is 255, the three following bytes are used for code points in the range U+010000 to U+10FFFF.
Now you would only need one byte for the code points from 0 to 253, but at least three bytes for all other code points, which is wasteful for Greek, Cyrillic, East Asian and many other languages.
UTF-8 was carefully designed and is really great. Try to find some background information about it to understand all its beauty.