Python subprocess echo a unicode literal

Question

I'm aware that questions like this have been asked before. But I'm not finding a solution.

I want to use a unicode literal, defined in my python file, with the subprocess module. But I'm not getting the results that I need. For example the following code

# -*- coding: utf-8 -*-
import sys
import codecs
import subprocess
cmd = ['echo', u'你好']
new_cmd = []
for c in cmd:
    if isinstance(c,unicode):
        c = c.encode('utf-8')
    new_cmd.append(c)
subprocess.call(new_cmd)

prints out

ä½ å¥½

If I change the code to

# -*- coding: utf-8 -*-
import sys
import codecs
import subprocess
cmd = ['echo', u'你好']
new_cmd = []
for c in cmd:
    if isinstance(c,unicode):
        c = c.encode(sys.getfilesystemencoding())
    new_cmd.append(c)
subprocess.call(new_cmd)

I get the following

??

At this stage I can only assume I'm, repeatedly, making a simple mistake. But I'm having a hard time figuring out what it is. How can I get echo to print out the following when invoked via python's subprocess

你好

Edit:

The version of Python is 2.7. I'm running on Windows 8 but I'd like the solution to be platform independent.

Check you locale system. Try to setlocale https://docs.python.org/2/library/locale.html — oxana, May 05 '15 at 14:14
Oh, I thought you would actually have that problem too. @no_test proposed direction is probably a better idea then. — cnluzon, May 05 '15 at 14:16
@no_test - Do you have an example. I've read the page but I'm not understanding it. I'd guess this is about setting my computers language page. But why is that necessary if I can copy & paste the echo onto the command line. Should it not already be able to handle these characters? — Shane Gannon, May 05 '15 at 14:22
`import locale # Store your system locale loc=locale.getlocale() # change locale locale.setlocale(locale.LC_ALL, ('zh_CN','UTF8')) # return to system locale locale.setlocale(locale.LC_ALL, loc)` — oxana, May 05 '15 at 14:26
@ShaneGannon, out of interest if you pass a string instead and use shell=True what do you see? — Padraic Cunningham, May 05 '15 at 14:26
@Padraic Cunningham - If I do "subprocess.call('echo 你好', shell=True)" I get "ä½ å¥½" — Shane Gannon, May 05 '15 at 14:30
It sounds like `subprocess` itself is encoding the string to the wrong character set. If you do `chcp` at the command line what does it return? Edit: no it's not `subprocess`, I missed the part where you `encode` the parameters. — Mark Ransom, May 05 '15 at 14:33
That's weird, that code page isn't capable of displaying the characters you want. How is the `echo` working when you type it? — Mark Ransom, May 05 '15 at 14:35
Honestly don't know. Windows confuses me in this regard. I get that Linux/Unix use utf-8 by default & as a result should be able to support all characters. I'm not sure how Windows gets away with it. It not only supports echo but mkdir as well. — Shane Gannon, May 05 '15 at 14:38
You say in another comment that you use `Conemu`. That's probably why it works when you type it by hand, because `echo` is a shell built-in command. When you use `subprocess` with `shell=True`, it uses the default `cmd.exe`. — Mark Ransom, May 05 '15 at 14:46
Ah.... no. You're right. First time I came across a difference between ConEmu and cmd. It does not work when run from a normal cmd. i.e. "你好" gets converted into "??". That was super confusing. Good insight. — Shane Gannon, May 05 '15 at 14:50
Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/77006/discussion-on-question-by-shane-gannon-python-subprocess-echo-a-unicode-literal). — Taryn, May 05 '15 at 14:50

Serge Ballesta · Answer 1 · 2015-05-05T16:50:00.100

2

Your first try was the best.

You actually converted the 2 unicode characters u'你好' (or u'\u4f60\u597d') in UTF8 all that giving b'\xe4\xbd\xa0\xe5\xa5\xbd'.

You can control it in IDLE that fully support unicode and where b'\xe4\xbd\xa0\xe5\xa5\xbd'.decode('utf-8') gives back 你好. Another way to control it is to redirect script output to a file and open it with an UTF-8 compatible editor : there again you will see what you want.

But the problem is that Windows console does not support full unicode. It depends on :

the code page installed - I do not know for Windows 8 but previous versions had poor support for unicode and could display only 256 characters
the font used in the console - not all fonts have glyphs for all characters.

If you know a code page that contains glyphs for your characters (I don't), you can try to insert it in a console with chcp and explicitely encode your unicode string to that. But on my french machine, I do not know how to do ... except by passing by a text file !

As you spoke of ConEmu, I did it a try ... and it works fine with it, with python 3.4 !

chcp 65001
py -3
import subprocess
cmd = ['cmd', '/c', 'echo', u'\u4f60\u597d']
subprocess.call(cmd)

gives :

你好  
0

The problem is only in the cmd.exe windows !

edited May 05 '15 at 16:50

answered May 05 '15 at 15:11

Serge Ballesta

143,923
11
122
252

Unsure that this is really an answer because I do not say *do that and it will work*, but it gives some hints – Serge Ballesta May 05 '15 at 15:12
hmm... as an experiment I ran "print b'\xe4\xbd\xa0\xe5\xa5\xbd'.decode('utf-8')" from the python interpreter but all I got back was "õ¢áÕÑ¢". The annoying thing is that, surprise for me, the cmd does not support 你好. But I can still create a folder with this name from the GUI and Conemu. So an API/approach exists somewhere. – Shane Gannon May 05 '15 at 15:17
@ShaneGannon : I didn't use ConEmu (seems nice ...) but as it is internally a GUI application, it can have full unicode support like IDLE has. The problem is only for console applications running in a `cmd.exe` window. – Serge Ballesta May 05 '15 at 16:20
True. So far it looks like cmd does not support unicode properly. Even after switching the font to "Lucida Console" I can only get some Unicode characters to render. E.g. š but not 你好. – Shane Gannon May 05 '15 at 16:25
@ShaneGannon : it works fine with ConEmu. See my edit. And I can confirm that ConEmu uses Consolas as do my cmd.exe windows ... – Serge Ballesta May 05 '15 at 16:47
Gives unusual behaviour with Python 2.7. If I run the above edit from the interpreter then every line I run prints out "LookupError: unknown encoding: cp65001". If I run it from the command line via a script I get 'UnicodeEncodeError'. But one of the main reasons for Python 3 is to fix Unicode. So it's not too surprising that it may behave better, – Shane Gannon May 05 '15 at 16:58
@ShaneGannon the problem is that the command window is forced to interpret the byte sequence emitted through `stdout`, and it does that using the code page. `ConEmu` is able to bypass `stdout` for internal commands such as `echo`, and write to the window with Unicode directly. Code page 65001 is supposed to use UTF-8, but there were problems with it and so the Python developers took a long time to get it to work to their satisfaction - that work couldn't be backported to 2.7. – Mark Ransom May 05 '15 at 22:30

score 1 · Accepted Answer · edited May 23 '17 at 11:44

Conclusion: Pay attention to character encodings (there are three different character encodings here). Use Python 3 if you want portable Unicode support (pass arguments as Unicode, don't encode them) or make sure that the data can be represented using current character encodings from the environment (encode using sys.getfilesystemencoding() on Python 2 as you do in the 2nd code example).

The first code example is incorrect. The effect is the same as (run it in IDLE -- py -3 -midlelib):

>>> print(u'你好'.encode('utf-8').decode('mbcs')) #XXX DON'T DO IT!
ä½ å¥½

where mbcs codec uses your Windows ANSI code page (typically: cp1252 character encoding -- it may be different e.g., cp1251 on Russian Windows).

Python 2 uses CreateProcess macros to start a subprocess that is equivalent to CreateProcessA function there. CreateProcessA interprets input bytes as being encoded using your Windows ANSI encoding. It is unrelated to the Python source code encoding (utf-8 in your case).

It is expected that you get mojibake if you use a wrong encoding.

Your second code example should work if input characters can be represented using Windows code page such as cp1252 (to enable encoding from Unicode to bytes) and if echo uses Unicode API to print to Windows console such as WriteConsoleW() (see Python 3 package win-unicode-console -- it enables print(u'你好') whatever your chcp ("OEM") is as long as the font in console supports the characters) or the characters can be represented using OEM code page (used by cmd.exe) such as cp437 (run chcp to find out yours). ?? question marks indicate that 你好 can't be represented using your console encoding.

To support arbitrary Unicode arguments (including characters that can't be represented using either Windows ("ANSI") or MS-DOS (OEM) code pages), you need CreateProcessW function (that is used by Python 3). See Unicode filenames on Windows with Python & subprocess.Popen().

Unfortunately this is the correct answer. With Python 2.7 on Windows 8 even with Lucida Console font enabled it is not possible to represent all characters. I had the luxury of moving to another platform in order to get this to work. — Shane Gannon, May 06 '15 at 16:46
@ShaneGannon: It is possible to show all characters (at least those that are supported by Lucida Console font) e.g., you could call `CreateProcessW` yourself using `ctypes` module. See [the last link in my answer](http://stackoverflow.com/q/1910275/4279). Or you could use [`WriteConsoleW()`](http://stackoverflow.com/a/19206014/4279) to write Unicode from Python directly to Windows console. And if you don't need to support Windows console then just use `'utf-8'` encoding and redirect the output to a file (or another program if it allows to specify its input encoding). — jfs, May 06 '15 at 16:56
I don't think Lucida Console font supports 你好. Since I can choose not to use Windows utf-8 works well for me. — Shane Gannon, May 06 '15 at 17:30

Python subprocess echo a unicode literal

2 Answers2