22

Can someone please confirm that all Kanji characters in Chinese are 3 bytes long in UTF-8?

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
TopCoder
  • 4,206
  • 19
  • 52
  • 64

2 Answers2

32

The commonly used Hanzi/Kanji characters are in the "CJK Unified Ideographs" block between U+4E00 and U+9FFF, and take 3 bytes in UTF-8. (The Japanese Hiragana and Katakana characters also take 3 bytes.)

However, there are also some very rarely-used characters in the "CJK Unified Ideographs Extension B" and "CJK Compatibility Ideographs Supplement" blocks, which take 4 bytes in UTF-8.

Also be aware that Chinese text often contains ASCII characters like the digits 0-9.

dan04
  • 87,747
  • 23
  • 163
  • 198
  • 2
    Japanese text sourced from Shift-JIS is also likely to contain other non-Kanji, non-ASCII characters mapping to two-byte sequences. And then we'll shortly have the emoji to contend with, which are also outside the Basic Multilingual Plane and so 4 bytes... – bobince Sep 10 '10 at 11:28
  • 2
    @sleske: No, I don't *speak* Chinese. I've just done way too much work with character encoding. – dan04 Sep 10 '10 at 13:17
  • 2
    @sleske and also... this is the internet. SO has most likely people who speak languages you haven't even heard of. – Julian Aug 21 '12 at 17:22
  • 2
    See also this question over on the Japanese stack exchange: http://japanese.stackexchange.com/q/6872/16273 -- apparently some of the "rarely-used" characters aren't all that rare. – benkc Jul 25 '16 at 21:48
3

Yes, Kanji is U+4e00 to U+9faf, UTF8 3 bytes are U+0800 to U+FFFF.

gawi
  • 13,940
  • 7
  • 42
  • 78