How to convert unicode to its original character in Python

Question

I first tried typing in a Unicode character, encode it in UTF-8, and decode it back. Python happily gives back the original character. I took a look at the encoded string, it is b'\xe6\x88\x91'. I don't understand what this is, it looks like 3 hex numbers.

Then I did some research and I found that the CJK set starts from 4E00, so now I want Python to show me what this character looks like. How do I do that? Do I need to convert 4E00 to the form of something like the one above?

http://www.joelonsoftware.com/articles/Unicode.html – Ignacio Vazquez-Abrams Nov 26 '14 at 20:26 — Ignacio Vazquez-Abrams, Nov 26 '14 at 20:26

score 0 · Answer 1 · answered Nov 26 '14 at 20:09

0

You'll need to decode it using the UTF-8 encoding:

>>> print(b'\xe6\x88\x91'.decode('UTF-8'))
我

By decoding it you're turning the bytes (which is what b'...' is) into a Unicode string and that's how you can display / use the text.

answered Nov 26 '14 at 20:09

Simeon Visser

118,920
18
185
180

The text given in the question isn't encoded in utf-8, it's encoded in windows-1252. Using the detect function in chardet will show this. – David Greydanus Nov 26 '14 at 20:21
@DavidGreydanus: most likely not, the user already told us the encoding is UTF-8 and displaying the text as windows-1252 doesn't seem to return valid text. – Simeon Visser Nov 26 '14 at 20:24

score 0 · Accepted Answer · answered Nov 26 '14 at 20:10

0

The text b'\xe6\x88\x91' is the representation of the bytes that are the utf-8 encoding of the unicode codepoint \u6211 which is the character 我. So there is no need in converting something, other than to a unicode string with .decode('utf-8').

answered Nov 26 '14 at 20:10

Daniel

42,087
4
55
81

So to convert 4E00 to the original character, what should I do? I'm not sure what 4E00 is, I found it here: http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode – Xufeng Nov 26 '14 at 20:24
Converting "4E00" to a character is a completely different question. – Ignacio Vazquez-Abrams Nov 26 '14 at 20:25
@Xufeng: what is your real problem? `'\u4E00'` is the unicode representation of chinese «one»: 一. To write this character to disk, you have to encode it, e.g. with UTF-8: `u'\u4E00'.encode('utf-8')` -> `b'\xe4\xb8\x80'`. – Daniel Nov 26 '14 at 20:52

How to convert unicode to its original character in Python

2 Answers2