UTF-8: How can the reader know how many bytes a character counts?

Asked Aug 02 '17 at 15:41

Active Aug 04 '17 at 08:06

Viewed 1,759 times

UTF-8 can represent each character by one byte or more. Let's suppose that I have the following byte sequence:

48 65

How can I know if it's one character represented by 48 and another character represented by 65, or it's ONE character represented by a combination of TWO bytes 48 65?

edited Aug 03 '17 at 23:19

Remy Lebeau

555,201
31
458
770

asked Aug 02 '17 at 15:41

CrazySynthax

13,662
34
99
183

Possible duplicate of [Detect UTF-8 encoding (How does MS IDE do it)?](https://stackoverflow.com/questions/11479143/detect-utf-8-encoding-how-does-ms-ide-do-it) – Aug 02 '17 at 15:43
2

Because the [most significant bits in the first byte of a codepoint tell a UTF-8 decoder how many bytes make up a codepoint](https://en.wikipedia.org/wiki/UTF-8). – Phylogenesis Aug 02 '17 at 15:43
1

Also, you should be careful with your terminology when it comes to Unicode. What you're talking about here is individual 'code points'. What you probably consider to be a character (or [grapheme cluster](http://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html#unicode.introduction_to_unicode.grapheme_clusters)) can be made up of an arbitrary number of individual code points. For instance, the character `é` can be encoded as `U+00E9` ('LATIN SMALL LETTER E WITH ACUTE), or as `U+0065` (LATIN SMALL LETTER E) followed by `U+0301` (COMBINING ACUTE ACCENT). – Phylogenesis Aug 03 '17 at 08:12

UTF-8: How can the reader know how many bytes a character counts?

0 Answers0