Setting Python encoding for printing Chinese characters

Question

My code is below. I don't know why can't print Chinese. Please help.

When trying to print more than one variable at a time, the words look like ASCII or raw type.

How to fix it?

# -*- coding: utf-8 -*-
import pygoldilocks
import sys
reload(sys)  
sys.setdefaultencoding('utf8')

rows = ( '已','经激活的区域语言' )
print( rows[0] )
print( rows[1] )
print( rows[0], rows[1] )
print( rows[0].encode('utf8'), rows[1].decode('utf8') )
print( rows[0], 1 )


$ python test.py
已
经激活的区域语言
('\xe5\xb7\xb2', '\xe7\xbb\x8f\xe6\xbf\x80\xe6\xb4\xbb\xe7\x9a\x84\xe5\x8c\xba\xe5\x9f\x9f\xe8\xaf\xad\xe8\xa8\x80')
('\xe5\xb7\xb2', u'\u7ecf\u6fc0\u6d3b\u7684\u533a\u57df\u8bed\u8a00')
('\xe5\xb7\xb2', 1)

Because you're trying to run Python 3 code under Python 2 (which has [been in the process of being sunsetted since 2008](https://www.python.org/doc/sunset-python-2/)), and in any case `sys.setdefaultencoding('utf8')` is a notorious hack that people were warned not to use a decade ago: [Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?](https://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script) — smci, Jan 04 '22 at 08:23

score 3 · Accepted Answer · answered May 09 '18 at 07:59

All your outputs are normal. By the way, this:

reload(sys)  
sys.setdefaultencoding('utf8')

is really a poor man's trick to set the Python default encoding. It is seldom really useful - IMHO it is not in shown code - and should only be used when no cleaner way is possible. I had been using Python 2 for decades with non ascii charset (Latin1) and only used that in my very first scripts.

And the # -*- coding: utf-8 -*- is not used either by Python here, though it may be useful for your text editor: it only makes sense when you have unicode literal strings in your script - what you have not.

Now what really happens:

You define row as a 2 tuple of (byte) strings containing chinese characters encoded in utf8. Fine.

When you print a string, the characters are passed directly to the output system (here a terminal or screen). As it correctly processes UTF8 it converts the utf8 byte representation into the proper characters. So print (row[0]) (which is executed as print row[0] in Python 2 - (row[0]) is not a tuple, (row[0],) is a 1-tuple) correctly displays chinese characters.

But when you print a tuple, Python actually prints the representation of the elements of the tuple (it would be the same for a list, set or map). And in Python 2, the representation of a byte or unicode string encodes all non ASCII characters in \x.. of \u.... forms.

In a Python interactive session, you should see:

>>> print rows[0]
已
>>> print repr(rows[0])
'\xe5\xb7\xb2'

TL/DR: when you print containers, you actually print the representation of the elements. If you want to display the string values, use an explicit loop or a join:

print '(' + ', '.join(rows) + ')'

displays as expected:

(已, 经激活的区域语言)

I appreciate for your kindness. I really understood. – Kwang Hun Lee May 09 '18 at 09:01 — Kwang Hun Lee, May 09 '18 at 09:01

score 0 · Answer 2 · answered May 09 '18 at 07:22

0

Your problem is that you are using Python 2, I guess. Your code

print( rows[0], rows[1] )

is evaluated as

tmp = ( rows[0], rows[1] ) # a tuple!
print tmp # Python 2 print statement!

Since the default formatting for tuples is done via repr(), you see the ASCII-escaped representation.

Solution: Upgrade to Python 3.

answered May 09 '18 at 07:22

Ulrich Eckhardt

16,572
3
28
55

Thanks for explanation. – Kwang Hun Lee May 09 '18 at 09:04

score 0 · Answer 3 · answered May 09 '18 at 07:58

There are two less drastic solutions than upgrading to Python 3.

The first is not to use Python 3 print() syntax:

rows = ( '已','经激活的区域语言' )
print rows[0] 
print rows[1] 
print rows[0], rows[1] 
print rows[0].decode('utf8'), rows[1].decode('utf8') 
print rows[0], 1

已
经激活的区域语言
已 经激活的区域语言
已 经激活的区域语言
已 1

The second is to import Python 3 print() syntax into Python 2:

from __future__ import print_function

rows = ( '已','经激活的区域语言' )
print (rows[0]) 
print (rows[1])
print (rows[0], rows[1]) 
print (rows[0].decode('utf8'), rows[1].decode('utf8'))
print (rows[0], 1)

Output is the same.

And drop that sys.setdefaultencoding() call. It's not intended to be used like that (only in the site module) and does more harm than good.

Thansks. I understand now. – Kwang Hun Lee May 09 '18 at 09:09 — Kwang Hun Lee, May 09 '18 at 09:09

Setting Python encoding for printing Chinese characters

3 Answers3