UTF-8 Continuation bytes

Question

I'm trying to figure out what "continuation bytes" are (for curiousity sake) in the UTF-8 encoding.

Wikipedia introduces this term in the UTF-8 article without defining it at all

Google search returns no useful information either. I'm about to jump into the official specification, but would preferably read a high-level summary first.

Looks like somebody just edited the Wikipedia article. (: – tripleee Feb 20 '12 at 13:08 — tripleee, Feb 20 '12 at 13:08

paxdiablo · Accepted Answer · 2020-08-19T02:55:17.720

56

A continuation byte in UTF-8 is any byte where the top two bits are 10.

They are the subsequent bytes in multi-byte sequences. The following table may help:

Unicode code points  Encoding  Binary value
-------------------  --------  ------------
 U+000000-U+00007f   0xxxxxxx  0xxxxxxx

 U+000080-U+0007ff   110yyyxx  00000yyy xxxxxxxx
                     10xxxxxx

 U+000800-U+00ffff   1110yyyy  yyyyyyyy xxxxxxxx
                     10yyyyxx
                     10xxxxxx

 U+010000-U+10ffff   11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                     10zzyyyy
                     10yyyyxx
                     10xxxxxx

Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.

The basic rules are this:

If a byte starts with a 0 bit, it's a single byte value less than 128.
If it starts with 11, it's the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
If it starts with 10, it's a continuation byte.

This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.

Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.

edited Aug 19 '20 at 02:55

answered Feb 20 '12 at 04:30

paxdiablo

854,327
234
1,573
1,953

I think this is a bit of a stretch/overbroading... anyways, it contains useful info! What I can't understand though, is why `11` is ever needed. One can say that the leading byte starts with `0`, and the continuation ones (they can be arbitarily many) start with `1`. – EKons Aug 27 '16 at 18:37
@ΈρικΚωνσταντόπουλος "_the leading byte starts with `0`_" -- this is not correct. A byte that starts with `0` is a single-byte code point, so it is neither a _leading_ byte nor a continuation. It stands alone. That's what makes it distinct from bytes starting with `11`, which indicate it is the first byte of a _sequence_ and more bytes are expected to follow in order to represent a single code point. – William Price Mar 21 '17 at 20:45
@WilliamPrice Dunno why I posted that off-topic comment, but I think it was me trying to invent my own encoding. – EKons Mar 22 '17 at 11:40
You say that 110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four. Shouldn't it be the other way? 010 for 2, 011 for 3, 100 for 4? And if not, why not in binary code? – Cornelius Aug 19 '20 at 02:44
@Cornelius: no, the bit patterns you've given are the binary *values* for 2, 3, and 4, but that has nothing do do with UTF-8 encoding. If the first byte starts with `110`, there is one continuation byte. If it starts `1110`, there are two continuation bytes. Ditto for `11110` having three continuation bytes. It may help to think of the *first* bit of the first byte deciding whether it's a multi-byte (if `1`) or not (if `0`). Then the number of consecutive `1` bits *after* that is the number of continuation bytes that follow (1-3). – paxdiablo Aug 19 '20 at 02:52
Ok, so the standard is not using binary, but why? Why don't we have first three bits in binary deciding how many bytes the character consumes, and no leading bits anywhere? They obviously avoided using binary so that starting byte of a 4 or 5 byte character wouldn't get confused with a leading byte. That means leading bits are important. Why is it important to know something is a leading byte, just by looking at that one single byte? – Cornelius Aug 19 '20 at 13:07
@Cornelius, why they did it that way I don't know, but I know what you propose wouldn't work. It was important to be backward compatible with 7-bit ASCII so they could only use bytes with the top bit set. That means no prefix less than four (`100`). They *could* possibly have used a different scheme but the one they went with is the one we have, plain and simple :-) – paxdiablo Aug 19 '20 at 15:27
Oooh, I understand now, but, only when I think the opposite of what you said :) It is more a backward incompatibility or unambiguity. If you encounter a line of bytes that start with 1, it's a Unicode. If it's a zero (in UTF-8), it's actually ASCII. – Cornelius Aug 20 '20 at 12:27
Cornelius: sort of. Just wanted to make clear that a character is not *either* Unicode/ASCII, since ASCII is a proper subset. If you have a file consisting of UTF-8 code points all under `0x80`, this can be processed in exactly the same way by an ASCII-only program or a UTF-8-aware one. That's the backwards compatibility. – paxdiablo Aug 20 '20 at 21:04
@EKons One argument in favour of the existing UTF8 in contrast to yours is that, if some error occurs and some bytes get corrupted, then non-affected characters would still have their correct meaning. Suppose if I send `0xxxxxxx` `110xxxxx` `10xxxxxx` (a 1byte char, a 2byte char), and it's second byte gets modified to become `100xxxxx`, then decoder can still be able to print first character properly. But in your variant, .... – Sourav Kannantha B Jan 17 '22 at 18:21
@EKons .... if I send `0xxxxxxx` `0xxxxxxx` `1xxxxxxx` (a 1byte char, a 2byte char), and it's second byte gets corrupted to `1xxxxxxx` then whola, now parser will falsely parse it as an entirely new 3-byte character, so in this case, you lost 2-characters for a single byte error. – Sourav Kannantha B Jan 17 '22 at 18:21

score 1 · Answer 2 · answered Feb 20 '12 at 04:31

1

In short words, continuation bytes are the bytes except first byte or single byte. In UTF-8, continuation bytes are started with 0x10.

answered Feb 20 '12 at 04:31

rogerz

1,073
9
18

I think you mean to say 0b10000000, which is 0x80 (the relevant part being the first two bits.) They use `x` differently in the specifcation. See https://tools.ietf.org/html/rfc3629#section-3 – tay10r Nov 03 '20 at 19:21

score -4 · Answer 3 · answered Feb 20 '12 at 05:08

“Continuation byte” isn’t a term but a normal English word and the term “byte.” If used as a pseudo-term, it may confuse the reader.

The Unicode Standard uses this expression in one place only, Ch. 5, clause 5.22: “For example, consider the first three bytes of a four-byte UTF-8 sequence, followed by a byte which cannot be a valid continuation byte: .” In this context, the meaning is clear: it’s just a byte that continues something, namely a sequence of bytes.

The Wikipedia page apparently uses “continuation byte” to mean any byte in the UTF-8 encoding except the first byte of the encoded form of a character.

UTF-8 Continuation bytes

3 Answers3

Linked