What is the encoding of Chinese characters on Wikipedia?

Question

I was looking at the encoding of Chinese characters on Wikipedia and I'm having trouble figuring out what they are using. For instance "的" is encoded as "%E7%9A%84" (see here). That's three bytes, however none of the encodings described on this page uses three bytes to represent Chinese characters. UTF-8 for instance uses 2 bytes.

I'm basically trying to match these three bytes to an actual character. Any suggestion on what encoding it could be?

The UTF-8 encoding of that character is `E7 9A 84`. – John Flatness Apr 10 '11 at 05:44 — John Flatness, Apr 10 '11 at 05:44

score 31 · Accepted Answer · answered Apr 10 '11 at 05:49

31


>>> c='\xe7\x9a\x84'.decode('utf8')
>>> c
u'\u7684'
>>> print c
的

though Unicode encodes it in 16 bits, utf8 breaks it down to 3 bytes.

answered Apr 10 '11 at 05:49

jcomeau_ictx

37,688
6
92
107

1

Thanks, I assumed UTF-8 was using the same encoding as unicode. That makes sense now. – laurent Apr 10 '11 at 06:01
24

@Laurent: No, because (please repeate after me) *Unicode is not an encoding*. Unicode is a standard for representing text, and encoding (actually, several encodings) is part of the standard. – sleske Jun 09 '11 at 08:39
@Laurent: You may be confused by the fact that in UTF-32 (which is one encoding) characters are in fact encoded by their codepoint number (i.e. the encoding is trivial). But there are other encodings, and UTF-32 is acutally not used very often. – sleske Jun 09 '11 at 08:56

score 20 · Answer 2 · answered Apr 10 '11 at 05:53

20

The header of a wikipedia page includes this:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

So the page is UTF-8.

answered Apr 10 '11 at 05:53

Adam

16,808
7
52
98

6

charset is misleading. It means "encoding" and not character sets, which are often confused together – hAcKnRoCk Jan 13 '17 at 12:57

score 6 · Answer 3 · answered Nov 30 '14 at 23:08

The example you give is an IRI.

IRIs use the UTF8 encoding. UTF8 implements unicode, and in unicode, each character has a codepoint, that is between 0x4E00 and 0x9FFF (2 bytes) for all chinese characters.

But UTF8 doesn't encode characters by just storing their codepoint (UTF32 does that). Instead, it uses a more complex standard, that makes all chinese ideograms 2 or 3 bytes long.

What is the encoding of Chinese characters on Wikipedia?

3 Answers3

Linked