5

I need help understanding how "every code point from 0-127 is stored in a single byte" as quoted from below.

Here is the context:

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Source: http://www.joelonsoftware.com/articles/Unicode.html

I understand that the numbers 0-127 are what they use to represent ASCII characters with. Unicode characters are represented by U+HexHex, aka a code point. How does 0-127 translate to a code point? If each hex number can represent 16 options, then one 8 bit byte can fit 2 hex numbers (2^8=16*16=256).

Question: But then there are 256 characters that can be represented, so why stop at 127? I can see why you need 2 bytes for characters above 256 code points, but why do you need 2 bytes for code points 128-256?

imagineerThat
  • 5,293
  • 7
  • 42
  • 78
  • 2
    Short answer, the first bit defines if the UTF-8 character is multi-byte. [See this chart on Wikipedia](https://en.wikipedia.org/wiki/UTF-8#Description). – Alexander O'Mara Sep 27 '14 at 02:50
  • 1
    If you used all the possible values 0-255 for the Unicode code points 0-255, how would you use code points higher than that? How could you tell the difference between a 16-bit code point and two 8-bit code points? I've always thought UTF-8 was quite clever. – Mark Ransom Sep 27 '14 at 02:57
  • @Mark: Well, it says code points 128 and above are stored using 2 or more bytes. – imagineerThat Sep 27 '14 at 02:58
  • 1
    @imagineerThis that's true, they are. They *must* be in order to allow codepoints greater than 255. P.S. the actual limit was changed after Joel wrote that article, the maximum codepoint is 0x10ffff which only requires 4 bytes in UTF-8. – Mark Ransom Sep 27 '14 at 03:06
  • 1
    possible duplicate of [How does UTF-8 "variable-width encoding" work?](http://stackoverflow.com/questions/1543613/how-does-utf-8-variable-width-encoding-work) – Joe Sep 27 '14 at 11:12

1 Answers1

6

As an analogy, suppose you want to write a long text onto multiple pages, and you need to know exactly when the text ends. Then you would probably reserve a small part of the lower right corner for a note that says either “the text continues on the next page” or “the text ends here”. Example:

page 1: This is a very [the text continues on the next page]
page 2: long text that [the text continues on the next page]
page 3: does not fit   [the text continues on the next page]
page 4: on one page.   [the text ends here]

It should be obvious that the lower right corner of the page cannot be used for the normal text, since it is already used by the continuation marker.

A very similar technique is used by UTF-8 when converting a sequence of bytes into a sequence of code points. The rules are:

  • If the first byte of the sequence is between 0 and 127, its value is the code point.
  • If the first byte of the sequence is between 128 and 191, it is an error.
  • If the first byte of the sequence is between 192 and 255, it belongs to a sequence of several bytes, and some bits of these bytes are used to calculate the code point. The following bytes must be between 128 and 191.

This means that the highest bit of each byte works as the marker that says “this byte is part of a multi-byte code point sequence”. Because this bit has this meaning and cannot have any other meaning, only the code points from 0 to 127 can be represented using one byte. All other code points need more than one byte.


UTF-8 is not the only possibility for storing Unicode code points in a sequence of bytes. You could also define an encoding with these rules:

  • If the first byte is between 0 and 253, it represents its code point.
  • If the first byte is 254, the two following bytes are used for code points in the range 254 to 65535.
  • If the first byte is 255, the three following bytes are used for code points in the range U+010000 to U+10FFFF.

Now you would only need one byte for the code points from 0 to 253, but at least three bytes for all other code points, which is wasteful for Greek, Cyrillic, East Asian and many other languages.

UTF-8 was carefully designed and is really great. Try to find some background information about it to understand all its beauty.

Roland Illig
  • 40,703
  • 10
  • 88
  • 121