0

Consider the Chinese letter "語" - its UTF-8 encoding is:

11101000 10101010 10011110

While its UTF-16 encoding is shorter:

10001010 10011110

I'd love to understand why UTF-8 is bigger. I researched, but still let me break down what I want to understand.

  • how is highest bit used to say how many byte character needs ?
  • what's the code point of ?
  • I'd appreciate if you could encode the the above character in both UTF-8 and UTF-16 in simple terms so I understand why UTF-16 is smaller.
Nika Kurashvili
  • 6,006
  • 8
  • 57
  • 123
  • It's not _bigger_, it's _different_; see https://en.wikipedia.org/wiki/UTF-8#Encoding And codepoint for _語_ is _U+8A9E_ , see https://www.unicode.org/Public/UNIDATA/UnicodeData.txt – JosefZ Apr 05 '21 at 10:28
  • This is a well know "problem" (from beginning of UTF-8): UTF-8 can be longer then UFT-16. UTF-8 is just handy because it is compatible with ASCII, and many program can handle UTF-8 transparently (like other encodings). Note: UTF-8 is not make to be smaller as possible (a lot of redundancies, so it is easier to detect errors and to realign on random accesses). – Giacomo Catenazzi Apr 06 '21 at 09:20

0 Answers0