2

I first tried typing in a Unicode character, encode it in UTF-8, and decode it back. Python happily gives back the original character. I took a look at the encoded string, it is b'\xe6\x88\x91'. I don't understand what this is, it looks like 3 hex numbers.

Then I did some research and I found that the CJK set starts from 4E00, so now I want Python to show me what this character looks like. How do I do that? Do I need to convert 4E00 to the form of something like the one above?

Xufeng
  • 6,452
  • 8
  • 24
  • 30

2 Answers2

0

You'll need to decode it using the UTF-8 encoding:

>>> print(b'\xe6\x88\x91'.decode('UTF-8'))
我

By decoding it you're turning the bytes (which is what b'...' is) into a Unicode string and that's how you can display / use the text.

Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • The text given in the question isn't encoded in utf-8, it's encoded in windows-1252. Using the detect function in chardet will show this. – David Greydanus Nov 26 '14 at 20:21
  • @DavidGreydanus: most likely not, the user already told us the encoding is UTF-8 and displaying the text as windows-1252 doesn't seem to return valid text. – Simeon Visser Nov 26 '14 at 20:24
0

The text b'\xe6\x88\x91' is the representation of the bytes that are the utf-8 encoding of the unicode codepoint \u6211 which is the character 我. So there is no need in converting something, other than to a unicode string with .decode('utf-8').

Daniel
  • 42,087
  • 4
  • 55
  • 81
  • So to convert 4E00 to the original character, what should I do? I'm not sure what 4E00 is, I found it here: http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode – Xufeng Nov 26 '14 at 20:24
  • Converting "4E00" to a character is a completely different question. – Ignacio Vazquez-Abrams Nov 26 '14 at 20:25
  • @Xufeng: what is your real problem? `'\u4E00'` is the unicode representation of chinese «one»: 一. To write this character to disk, you have to encode it, e.g. with UTF-8: `u'\u4E00'.encode('utf-8')` -> `b'\xe4\xb8\x80'`. – Daniel Nov 26 '14 at 20:52