Can someone please confirm that all Kanji characters in Chinese are 3 bytes long in UTF-8?
Asked
Active
Viewed 1.9k times
2 Answers
32
The commonly used Hanzi/Kanji characters are in the "CJK Unified Ideographs" block between U+4E00 and U+9FFF, and take 3 bytes in UTF-8. (The Japanese Hiragana and Katakana characters also take 3 bytes.)
However, there are also some very rarely-used characters in the "CJK Unified Ideographs Extension B" and "CJK Compatibility Ideographs Supplement" blocks, which take 4 bytes in UTF-8.
Also be aware that Chinese text often contains ASCII characters like the digits 0-9.

dan04
- 87,747
- 23
- 163
- 198
-
2Japanese text sourced from Shift-JIS is also likely to contain other non-Kanji, non-ASCII characters mapping to two-byte sequences. And then we'll shortly have the emoji to contend with, which are also outside the Basic Multilingual Plane and so 4 bytes... – bobince Sep 10 '10 at 11:28
-
2@sleske: No, I don't *speak* Chinese. I've just done way too much work with character encoding. – dan04 Sep 10 '10 at 13:17
-
2@sleske and also... this is the internet. SO has most likely people who speak languages you haven't even heard of. – Julian Aug 21 '12 at 17:22
-
2See also this question over on the Japanese stack exchange: http://japanese.stackexchange.com/q/6872/16273 -- apparently some of the "rarely-used" characters aren't all that rare. – benkc Jul 25 '16 at 21:48