Chinese encoding in Python

Question

When I output some Chinese character in Python (Pandas), it shows as below

\xe8\xbf\x99\xe7\xa7\x8d\xe6\x83\x85\xe5\x86\xb5\xe6\x98\xaf\xe6\xb2\xb9\xe6\xb3\xb5\xe6\x95\x85\xe9\x9a\x9c\xe7\x81\xaf\xef\xbc\x8c\xe6\xa3\x80\xe6\x9f\xa5\xe4\xb8\x80\xe4\xb8\x8b\xe6\xb2\xb9\xe6\xb3\xb5\xe6\x8f\x92\xe5\xa4\xb4\xe6\x98\xaf\xe5\x90\xa6\xe6\x8e\xa5\xe8\x99\x9a\xef\xbc\x8c\xe7\x84\xb6\xe5\x90\x8e\xe6\x9f\xa5\xe4\xb8\x80\xe4\xb8\x8b\xe6\xb2\xb9\xe6\xb3\xb5\xe5\x86\x85\xe7\xae\xa1\xe9\x81\x93\xe5\x8e\x8b\xe5\x8a\x9b\xe6\x98\xaf\xe5\x90\xa6\xe7\xac\xa6\xe5\x90\x88\xe6\xad\xa3\xe5\xb8\xb8\xe5\x80\xbc\xe3\x80\x82

What is the encoding format? It is not unicode as I know. Thanks!

Try putting `# -*- coding: utf-8 -*-` at the top of your python source file to force Pytohn into UTF-8 — Ben, Jul 13 '18 at 22:24
@Ben A coding directive only affects how the interpreter decodes the script itself, it has no effect on what the script does to external data that it reads or writes. — PM 2Ring, Jul 13 '18 at 22:25
That looks like UTF-8 encoded Chinese to me, although I don't read Chinese. 这种情况是油泵故障灯，检查一下油泵插头是否接虚，然后查一下油泵内管道压力是否符合正常值。 — PM 2Ring, Jul 13 '18 at 22:28
@PM2Ring I'm assuming he's doing something like `print('你好')` and getting hex output. I don't have a lot of encoding problems, so I could very well be wrong — Ben, Jul 13 '18 at 22:28
It is hexadecimal. There are tools online to convert hexadecimal to text :D — Daming Lu, Jul 13 '18 at 23:27
Surely those online tools want to know what the encoding is as well? — Jongware, Jul 14 '18 at 00:01

score 1 · Accepted Answer · answered Jul 14 '18 at 15:05

The output you are receiving is called a bytes object. In order to decode it, you need to do output.decode('utf-8').

For example:

output = b'\xe8\xbf\x99\xe7...'
unicode_output = output.decode('utf-8')
print(unicode_output)

would then output non-latin characters (I cannot include it because it counts as spam).

Another way to do this in one-line would be: print(b'\xe8\xbf\x99\xe7...'.decode('utf-8')).

However, if that doesn't work, then it is probably because of the fact that your output isn't a bytes object, but is contained within a string. If that does not work, then there is another solution.

output = '\xe8\xbf\x99\xe7...'
exec('print(b\''+ output + '\'.decode(\'utf-8\'))')

That should be able to fix it. Hope you got something useful out of this. Have a good day!

score 0 · Answer 2 · answered Jul 13 '18 at 23:51

This is bytes type, containing a valid utf-8 Chinese text (as far as I can trust Google Translate).

If it's a string literal from your code, add # -*- coding: utf-8 -*- as the first line of your Python file.

If it's an external data, here's how to convert it to a text (str type): bytes_text.decode("utf-8")

score 0 · Answer 3 · answered Jul 14 '18 at 00:07

raw_bytes = b'\xe8\xbf\x99\xe7\xa7\x8d\xe6\x83\x85 . . .'

with raw_bytes a <class 'bytes'> object containing your hexadecimal characters you can then call decode on raw_bytes and get a <class 'str'> representation of your characters.

string_text = raw_bytes.decode("utf-8")

Chinese encoding in Python

3 Answers3