2

I have a python script that prints chinese output on command line. It works fine in eclipse. However, when I run it in dos window, it prints ? (question marks) and garbage characters. Could it be because of big-5 vs gb encoding? if so, how do I control it?

btw, I already installed the Asian character sets, which is why it works in Eclipse

edit:combining chcp, encode('utf-8'), and setting the non-unicode handler, I can now see the character, but a simple print results in a exception:

chcp 65001
Active code page: 65001

Z:\src>c:\Python27\python.exe mobTest.py
Traceback (most recent call last):
  File "mobTest.py", line 94, in <module>
    print u'哈哈'.encode('utf-8')
IOError: [Errno 13] Permission denied
Ching Liu
  • 1,527
  • 2
  • 11
  • 15
  • 2
    First, watch this video or read the accompanying slides: [How Do I Stop The Pain?](http://nedbatchelder.com/text/unipain.html) – Robᵩ Mar 19 '13 at 01:39
  • 1
    [Relevant.](http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using) – Cairnarvon Mar 19 '13 at 01:41
  • Hi Rob, I'm pretty familiar with the fundamentals of unicode display using Python, however I think this is a problem of unicode on Windows command line using Python. The code works fine in Eclipse, meaning that I didn't make fundamental errors (which the presentation addresses) I need more specific assistance – Ching Liu Mar 19 '13 at 04:04
  • Code page 65001 is not supported in Python 2. Support was added in Python 3, but it is broken. Encode in `cp936` if you've set the locale to Chinese. See my updated answer. – Mark Tolonen Mar 23 '13 at 14:12

2 Answers2

2

What is your system locale? English (United States), for example, uses code page 437 for the console, which doesn't support Chinese characters. Chinese (Simplified, PRC) makes it possible to print Chinese to the console.

You can change the setting in Region and Language in Control Panel (Windows 7), Administrative tab and rebooting. After that, printing a Unicode Chinese string to the console will work. You can even type in Chinese as an IME will be available.

Changing the system locale will only affect the console and non-Unicode programs. Most modern programs won't notice.

Edit: Example using Chinese PRC region and running in the Windows console:

Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'哈哈'
哈哈
>>> import sys
>>> sys.stdout.encoding
'cp936'

Example script using UTF-8 source encoding. Make sure to save the source in UTF-8, as declared by the #coding comment:

# coding: utf-8
print u'哈哈'
print '哈哈' # this will be UTF-8 encoded, and NOT work

Execution:

C:\>python x.py
哈哈
鍝堝搱
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • I tried Chinese (simplified, PRC), even then it came out bad. – Ching Liu Mar 19 '13 at 14:32
  • @ChingLiu, please post your code. Did you print a Unicode string or a UTF-8-encoded byte string? If a byte string, it should be encoded in the console encoding (`cp936` if I remember correctly). – Mark Tolonen Mar 19 '13 at 15:41
  • UTF-8 (cp65001) is broken. It doesn't work with Python 2 (or 3 very well). Just `print u'哈哈'`. It will be encoded in the console encoding by default (sys.stdout.encoding) which should be `cp936` if you changed the system locale. – Mark Tolonen Mar 19 '13 at 23:48
0

This is how I solved the problem for simplified Chinese:

  1. set display for non-unicode program under region and language settings to simplified Chinese
  2. add the following line to the python file (I recommend saving a backup first):

    -- coding: gbk --

this replaces any coding you have from before (in my case utf-8). Any utf-8 string already in your code will be re-encoded in gbk. So you have to re-enter those lines.

Now running in dos window and eclipse will yield the correct characters. I'm guessing that for traditional Chinese, similar things can be done by using traditional Chinese in Windows settings and big5. Testing of it will be left as exercise to reader.

Ching Liu
  • 1,527
  • 2
  • 11
  • 15
  • The encoding of the source file doesn't matter. You can use UTF-8 in the source file. The important point is to use Unicode strings for text. `print u'哈哈'`. – Mark Tolonen Mar 23 '13 at 13:40