how does UTF-8 end up with bigger bits than UTF-16

Asked Apr 05 '21 at 09:16

Active Apr 05 '21 at 09:16

Viewed 21 times

Consider the Chinese letter "語" - its UTF-8 encoding is:

11101000 10101010 10011110

While its UTF-16 encoding is shorter:

10001010 10011110

I'd love to understand why UTF-8 is bigger. I researched, but still let me break down what I want to understand.

how is highest bit used to say how many byte character needs ?
what's the code point of 語 ?
I'd appreciate if you could encode the the above character in both UTF-8 and UTF-16 in simple terms so I understand why UTF-16 is smaller.

asked Apr 05 '21 at 09:16

Nika Kurashvili

It's not _bigger_, it's _different_; see https://en.wikipedia.org/wiki/UTF-8#Encoding And codepoint for _語_ is _U+8A9E_ , see https://www.unicode.org/Public/UNIDATA/UnicodeData.txt – JosefZ Apr 05 '21 at 10:28
This is a well know "problem" (from beginning of UTF-8): UTF-8 can be longer then UFT-16. UTF-8 is just handy because it is compatible with ASCII, and many program can handle UTF-8 transparently (like other encodings). Note: UTF-8 is not make to be smaller as possible (a lot of redundancies, so it is easier to detect errors and to realign on random accesses). – Giacomo Catenazzi Apr 06 '21 at 09:20

0 Answers0