3

Yet another encoding question on Python.

How can I pass non-ASCII characters as parameters on a subprocess.Popen call?

My problem is not on the stdin/stdout as the majority of other questions on StackOverflow, but passing those characters in the args parameter of Popen.

Python script used for testing:

import subprocess

cmd = 'C:\Python27\python.exe C:\path_to\script.py -n "Testç on ã and ê"'

process = subprocess.Popen(cmd,stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
output, err = process.communicate()
result = process.wait()

print result, '-', output

For this example call, the script.py receives Testç on ã and ê. If I copy-paste this same command string on a CMD shell, it works fine.

What I've tried, besides what's described above:

  1. Checked if all Python scripts are encoded in UTF-8. They are.
  2. Changed to unicode (cmd = u'...'), received an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 128: ordinal not in range(128) on line 5 (Popen call).
  3. Changed to cmd = u'...'.decode('utf-8'), received an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 128: ordinal not in range(128) on line 3 (decode call).
  4. Changed to cmd = u'...'.encode('utf8'), results in Testç on ã and ê
  5. Added PYTHONIOENCODING=utf-8 env. variable with no luck.

Looking on tries 2 and 3, it seems like Popen issues a decode call internally, but I don't have enough experience in Python to advance based on this suspicious.

Environment: Python 2.7.11 running on an Windows Server 2012 R2.

I've searched for similar problems but haven't found any solution. A similar question is asked in what is the encoding of the subprocess module output in Python 2.7?, but no viable solution is offered.

I read that Python 3 changed the way string and encoding works, but upgrading to Python 3 is not an option currently.

Thanks in advance.

Dinei
  • 4,494
  • 4
  • 36
  • 60
  • 5
    If Python 3 isn't an option, then you'll have to use ctypes. In Python 2 `Popen` calls WinAPI `CreateProcessA`. The "A" suffix means this function decodes the command-line as an ANSI string (e.g. codepage 1252 in Western Europe) into a native UTF-16LE string. Almost all string handling in Windows and the kernel is UTF-16LE. The non-Unicode codepage encodings are a legacy from DOS and Windows 9x. Their primary use nowadays is to transform UTF-8 into meaningless mojibake... – Eryk Sun Jan 23 '18 at 03:35
  • In Python 3, `Popen` calls `CreateProcessW`, passing the command line as a native, UTF-16LE string. The CMD shell is also a Unicode application (since 1993) that calls wide-character `CreateProcessW`. – Eryk Sun Jan 23 '18 at 03:36
  • When CMD has to encode and decode strings (e.g. reading a batch script), it uses the legacy console codepage. So it may be possible to run `cmd /k chcp.com 65001` with stdin set to a pipe, and then pipe it the command line as a UTF-8 string. – Eryk Sun Jan 23 '18 at 03:43
  • Sorry, that doesn't work, so it's back to ctypes. After changing the console codepage, CMD does try to decode the piped string as UTF-8, but it does so one byte at a time while reading from the pipe, rather than decoding a line at a time. Obviously this fails for non-ASCII characters that use 2-4 bytes per character. – Eryk Sun Jan 23 '18 at 04:07
  • @eryksun Thanks, that's some useful information. Please post it as an answer and I'll accept it... Comments on stackoverflow are somewhat transient. – Dinei Jan 25 '18 at 14:51

1 Answers1

4

As noted in the comments, subprocess.Popen in Python 2 is calling the Windows function CreateProcessA which accepts a byte string in the currently configured code page. Luckily Python has an encoding type mbcs which stands in for the current code page.

cmd = u'C:\Python27\python.exe C:\path_to\script.py -n "Testç on ã and ê"'.encode('mbcs')

Unfortunately you can still fail if the string contains characters that can't be encoded into the current code page.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622