6

UTF-8 can represent each character by one byte or more. Let's suppose that I have the following byte sequence:

48 65

How can I know if it's one character represented by 48 and another character represented by 65, or it's ONE character represented by a combination of TWO bytes 48 65?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
CrazySynthax
  • 13,662
  • 34
  • 99
  • 183
  • Possible duplicate of [Detect UTF-8 encoding (How does MS IDE do it)?](https://stackoverflow.com/questions/11479143/detect-utf-8-encoding-how-does-ms-ide-do-it) –  Aug 02 '17 at 15:43
  • 2
    Because the [most significant bits in the first byte of a codepoint tell a UTF-8 decoder how many bytes make up a codepoint](https://en.wikipedia.org/wiki/UTF-8). – Phylogenesis Aug 02 '17 at 15:43
  • 1
    Also, you should be careful with your terminology when it comes to Unicode. What you're talking about here is individual 'code points'. What you probably consider to be a character (or [grapheme cluster](http://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html#unicode.introduction_to_unicode.grapheme_clusters)) can be made up of an arbitrary number of individual code points. For instance, the character `é` can be encoded as `U+00E9` ('LATIN SMALL LETTER E WITH ACUTE), or as `U+0065` (LATIN SMALL LETTER E) followed by `U+0301` (COMBINING ACUTE ACCENT). – Phylogenesis Aug 03 '17 at 08:12

0 Answers0