If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?

Question

On the Unicode site it's written that UTF-8 can be represented by 1-4 bytes. As I understand from this question https://softwareengineering.stackexchange.com/questions/77758/why-are-there-multiple-unicode-encodings UTF-8 is an 8-bits encoding. So, what's the truth? If it's 8-bits encoding, then what's the difference between ASCII and UTF-8? If it's not, then why is it called UTF-8 and why do we need UTF-16 and others if they occupy the same memory?

Sparky · Accepted Answer · 2015-05-13T15:15:04.603

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky - Wednesday, October 08, 2003

Excerpt from above:

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you'll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).

So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.

There are actually a bunch of other ways of encoding Unicode. There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn't be so bold as to waste that much memory.

And in fact now that you're thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box. Which did you get? -> �

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

Note that the quoted sentence _Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes_ is no longer accurate. Since that piece was written, Unicode has established an upper bound such that UTF-8 only needs 1-4 bytes (and never 5 or 6 bytes). That does not greatly affect the main thrust of Joel's article. Most people don't have to deal with UTF-7. UCS-2 is obsolete, a relic from days when Unicode was limited to 16-bit code points; UTF-16 handles the more modern, larger range (U+0000 .. U+10FFFF). UCS-4 is now a synonym for UTF-32. On the whole, use UTF names and not UCS. — Jonathan Leffler, Aug 12 '16 at 20:43

score 15 · Answer 2 · edited Oct 07 '21 at 05:49

UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.

To understand this further, Unicode treats characters as codepoints - a mere number that can be represented in multiple ways (the encodings). UTF-8 is one such encoding. It is most commonly used, for it gives the best space consumption characteristics among all encodings. If you are storing characters from the ASCII character set in UTF-8 encoding, then the UTF-8 encoded data will take the same amount of space. This allowed for applications that previously used ASCII to seamlessly move (well, not quite, but it certainly didn't result in something like Y2K) to Unicode, for the character representations are the same.

I'll leave this extract here from RFC 3629, on how the UTF-8 encoding would work:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

You'll notice why the encoding will result in characters occupying anywhere between 1 and 4 bytes (the right-hand column) for different ranges of characters in Unicode (the left-hand column).

UTF-16, UTF-32, UCS-2 etc. will employ different encoding schemes where the codepoints would represented as 16-bit or 32-bit codes, instead of 8-bit codes that UTF-8 does.

score 13 · Answer 3 · answered Jun 14 '11 at 04:18

13

The '8-bit' encoding means that the individual bytes of the encoding use 8 bits. In contrast, pure ASCII is a 7-bit encoding as it only has code points 0-127. It used to be that software had problems with 8-bit encodings; one of the reasons for Base-64 and uuencode encodings was to get binary data through email systems that did not handle 8-bit encodings. However, it's been a decade or more since that ceased to be allowable as a problem - software has had to be 8-bit clean, or capable of handling 8-bit encodings.

Unicode itself is a 21-bit character set. There are a number of encodings for it:

UTF-32 where each Unicode code point is stored in a 32-bit integer
UTF-16 where many Unicode code points are stored in a single 16-bit integer, but some need two 16-bit integers (so it needs 2 or 4 bytes per Unicode code point).
UTF-8 where Unicode code points can require 1, 2, 3 or 4 bytes to store a single Unicode code point.

So, "UTF-8 can be represented by 1-4 bytes" is probably not the most appropriate way of phrasing it. "Unicode code points can be represented by 1-4 bytes in UTF-8" would be more appropriate.

answered Jun 14 '11 at 04:18

Jonathan Leffler

730,956
141
904
1,278

then what the hell 21-bit character set means? UTF-8 - 8,16,24,32 bits, UTF-16 - 16,32 bits, UTF-32 - 32 bits. I don't see here 21. Sorry for being stupid. – Sergey Jun 14 '11 at 04:26
@Sergey, Unicode has 1,114,112 codepoints as of the latest version. You would need 21 bits at a minimum to fully specify all codepoints. – Vineet Reynolds Jun 14 '11 at 04:42
It means, Sergey, that the valid Unicode code points are all in the range U+0000 through U+10FFFF, and that U+10FFFF only requires 21 bits to represent it. The range is also chosen so that the Unicode code points can be encoded by two surrogates (a low surrogate and a high surrogate) in UTF-16. If the range was extended, that would no longer be possible. You will eventually learn to distinguish between the code points (U+wxyz values) and the various ways in which they can be encoded, such as UTF-8, UTF-16, and UTF-32. – Jonathan Leffler Jun 14 '11 at 04:43
Vineet is correct, but @Sergey has a point. Characterizing Unicode as "a 21-bit character set" is potentially confusing in the context of this question. Unicode has 17 planes; each plane has 65,536 code points, giving a total of 1,114,112. To represent that total in binary (in base 2, rather than base 10), you need 21 digits (bits). That is, the binary number 111111111111111111111 (21 bits), when represented in decimal, is 2,097,151 (2 to the power 21, minus 1), which is greater than 1,114,112. 20 bits (1,048,575) is not quite enough. To that extent, Unicode is a "21-bit character set". – Graham Hannington Jul 07 '14 at 04:54

Paulo Buchsbaum · Answer 4 · 2022-08-19T19:34:28.280

Just complementing the other answer about UTF-8 coding, that uses 1 to 4 bytes

As people said above, a code with 4 bytes totals 32 bits, but of these 32 bits, 11 bits are used as a prefix in the control bytes, i.e. to identify the code size of a Unicode symbol between 1 and 4 bytes and also enable to recover a text easily even in the middle of the text.

The gold question is: Why we need so much bits (11) for control in a 32 bits code? Wouldn't it be useful to have more than 21 bits for codification?

The point is that the planned scheme needs to be such that it is easily known to go back to the 1st. bite of a code.

Thus, bytes besides the first byte cannot have all their bits released for codify a Unicode symbol because otherwise they could easily to be confused as the first byte of a valid code UTF-8.

So the model is

0UUUUUUU for 1 byte code. We have 7 Us, so there are 2^7 = 128 possibilities that are the traditional ASCII codes.
110UUUUU 10UUUUUU for 2 bytes code. Here we have 11 Us so there are 2^11 = 2,048 - 128 = 1,921 possibilities. It discounts the previous gross number 2^7 because you need to discount the codes up to 2^7 = 127, corresponding to the 1 byte legacy ASCII.
1110UUUU 10UUUUUU 10UUUUUU for 3 bytes code. Here we have 16 Us so there are 2^16 = 65,536 - 2,048 = 63,488 possibilities)
11110UUU 10UUUUUU 10UUUUUU 10UUUUUU for 4 bytes code. Here we have 21 Us so there are 2^21 = 2,097,152 - 65,536 = 2,031,616 possibilities,

where U is a bit 0 or 1 used to codify a Unicode UTF-8 symbol.

So the total possibilities are 127 + 1,921 + 63,488 + 2,031,616 = 2,097,152 Unicode symbols.

In the Unicode tables available (for example, in the Unicode Pad App for Android or here) appear the Unicode code in form (U+H), where H is a hex number of 1 to 6 digits. For example U+1F680 represents a rocket icon: .

This code translates the bits U of the right to left symbol code (21 to 4 bytes, 16 to 3 bytes, 11 to 2 bytes and 7 to 1 byte), grouped in bytes, and with the incomplete byte on the left completed with 0s.

Below we will try to explain why one needs to have 11 bits of control. Part of the choices made was merely a random choice between 0 and 1, which lacks a rational explanation.

As 0 is used to indicate one byte code, what makes 0 .... always equivalent to the ASCII code of 128 characters (backwards compatibility)

For symbols that uses more than 1 byte, the 10 in the start of 2nd., 3rd. and 4th. byte always serves to know we are in the middle of a code.

To settle confusion, if the first byte starts with 11, it indicates that the 1st. byte represents a Unicode character with 2, 3 or 4 bytes code. On the other hand, 10 represents a middle byte, that is, it never initiates the codification of a Unicode symbol.(Obviously the prefix for continuation bytes could not be 1 because 0... and 1... would exhaust all possible bytes)

If there were no rules for non-initial byte, it would be very ambiguous. With this choice, we know that the first initial byte bit starts with 0 or 11, which never gets confused with a middle byte, which starts with 10. Just looking at byte we already know if it is a character ASCII, the beginning of a byte sequence (2, 3 or 4 bytes) or the byte from the middle of a byte sequence (2, 3 or 4 bytes).

It could be the opposite choice: The prefix 11 could indicate the middle byte and the prefix 10 the start byte in a code with 2, 3 or 4 bytes. That choice is just a matter of convention.

Also for matter of choice, the 3rd. bit 0 of the 1st. byte means 2 bytes UTF-8 code and the 3rd. bit 1 of the 1st. byte means 3 or 4 bytes UTF-8 code (again, it's impossible adopt prefix '11' for 2 bytes symbol, it also would exhaust all possible bytes: 0..., 10... and 11...).

So a 4th bit is required in the 1st. byte to distinguish 3 ou 4 bytes Unicode UTF-8 codification.

A 4th bit with 0 is for 3 bytes code and 1 is for 4 bytes code, which still uses an additional bit 0 that would be needless at first.

One of the reasons, beyond the pretty symmetry (0 is always the last prefix bit in the starting byte), for having the additional 0 as 5th bit in the first byte for the 4 bytes Unicode symbol, is in order to make an unknown string almost recognizable as UTF-8 because there is no byte in the range from 11111000 to 11111111 (F8 to FF or 248 to 255).

If hypothetically we use 22 bits (Using the last 0 of 5 bits in the first byte as part of character code that uses 4 bytes, there would be 2^22 = 4,194,304 possibilities in total (22 because there would be 4 + 6 + 6 + 6 = 22 bits left for UTF-8 symbol codification and 4 + 2 + 2 + 2 = 10 bits as prefix)

With adopted UTF-8 coding system (5th bit is fixed with 0 for 4 bytes code) , there are 2^21 = 2,097,152 possibilities, but only 1,112,064 of these are valid Unicodes symbols (21 because there are 3 + 6 + 6 + 6 = 21 bits left for UTF-8 symbol codification and 5 + 2 + 2 + 2 = 11 bits as prefix)

As we have seen, not all possibilities with 21 bits are used (2,097,152). Far from it (just 1,112,064). So saving one bit doesn't bring tangible benefits.

Other reason is the possibility of using this unused codes for control functions, outside Unicode world.

If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?

4 Answers4

Linked