1

I have an edge case problem. My Python script_A.py has this code (abbreviated).

script_A.py:
from __future__ import unicode_literals
import subprocess

executable = 'sample.exe'

kwargs['bufsize'] = 0
kwargs['executable'] = executable
kwargs['stdin'] = subprocess.PIPE
kwargs['stdout'] = subprocess.PIPE
kwargs['stderr'] = subprocess.PIPE
kwargs['preexec_fn'] = None
kwargs['close_fds'] = False
kwargs['shell'] = False
kwargs['cwd'] = None
kwargs['env'] = None
kwargs['universal_newlines'] = True
kwargs['startupinfo'] = None
kwargs['creationflags'] = 0
if sys.version_info.major == 3 and sys.version_info.minor > 5:
    kwargs['encoding'] = 'utf-8'

args = ['', '-x']

subproc = subprocess.Popen(args, **kwargs)

# service subproc.stdout and subproc.stderr on threads
stdout = _start_thread(_get_stdout, subproc)
stderr = _start_thread(_get_stderr, subproc)

with codecs.open('myutf-8.txt', encoding='utf-8') as fh:
    for line in fh:
        if os.name == 'nt':
            subproc.stdin.write(b'%s\n' % line.rstrip().encode('utf-8'))
        else:
            subproc.stdin.write('%s\n' % line.rstrip()) # OFFENDING LINE BELOW

stdout.join()

This code works on Python 2.7.14 and 3.6.4 on Windows 8/10 and Ubuntu 16.04/17.10 all the time. Note some of the kwargs values are different on Windows, but they are irrelevant here. It works on Python 3.5.2 on 16.04, but only when I execute script_A.py from Gnome terminal.

Sometimes, I need to use script_B.py to launch script_A.py instead of a terminal. Script_B.py has identical subprocess.Popen() code to launch the appropriate Python executable.

script_B.py
if os.name == 'nt':
    if use_py2:
        executable = 'C:\\Python27\\python.exe'
    else:
        executable = 'C:\\Program Files\\Python36\\python.exe'
else:
    if use_py2:
        executable = '/usr/bin/python'
    else:
        executable = '/usr/bin/python3'

args = ['', 'script_A.py']

# ---- ditto above code from here ----

I get this error when I execute script_A.py from script_B.py with Popen() on Python 3.5.2. None of the other combinations of OS/Python versions fail.

Traceback:
  File "script_A.py", line 30, in run
    subproc.stdin.write('%s\n' % line.rstrip())
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

You can see that on 2.7.14 and 3.6.4, I use specific code to force the pipes to utf-8. I don't know how to set utf-8 encoding on 3.5.2.

So, is there a way to configure encoding on 3.5.2 Popen's pipes? It might be easier to exclude Python 3.5 from support, but I thought I'd ask here.

tahoar
  • 1,788
  • 3
  • 20
  • 36

1 Answers1

2

Your input file is UTF-8, and the program you are feeding data to expects UTF-8 input. So just send the raw binary directly, instead of decoding from bytes to text, then reencoding from text to bytes.

Get rid of the line that turns on universal_newlines mode, and the line that sets kwargs['encoding'], and replace your whole with block that feeds stdin with:

blinesep = os.linesep.encode('utf-8')  # Since you seem to need OS specific line endings
with open('myutf-8.txt', 'rb') as fh:
    for line in fh:
        subproc.stdin.writelines((sline, blinesep))

You can still handle the stdout/stderr streams as text streams if you like, you just explicitly wrap them with io.TextIOWrapper and the appropriate encoding. For example, you can wrap the binary stdout with:

textout = io.TextIOWrapper(subproc.stdout, encoding='utf-8')

A couple side-notes:

  1. You're correct to explicitly set bufsize when calling Popen since it's impossible to behave consistently across Python versions without doing so; the default buffering behavior is unbuffered (bufsize=0) on Python 2 and Python 3.3.0 and earlier, and -1 (meaning "use decent default buffer size") in 3.3.1 and later. For performance, explicitly using bufsize=-1 is a good idea; you're threading the reads anyway, so buffering deadlocks aren't a concern.
  2. Never use codecs.open. It's buggy (doesn't translate line-endings, mixing readline with read(n) calls does weird things, when no encoding passed, it doesn't even wrap result of plain open, so the API changes, etc.), slow, and quasi-deprecated. If you need consistent behavior on Python 2.6 and higher, use io.open, which provides the Python 3 built-in open function consistently on Python 2.6 and higher.
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • ShadowRanger, thank you so much for your detailed suggestion. I'll address comments in groups. (1), universal_newlines is irrelevant; I'll drop it. Newlines pushed to .stdin() are simple '\n' (fixed with .rstrip() above). (2) Thanks for the tip about bufsize = -1. I'll try it. – tahoar Mar 01 '18 at 14:00
  • (3) I'm aware of the Unidode LINE_SEPs and other quirks in codecs.open() and I manage them separately. This particular temp file is created after normalizing those quirks upstream. Still, I need the same behavior on all OS & Python systems from codecs. I'll look into io.open() to see if I can get the same consistency. – tahoar Mar 01 '18 at 14:01
  • I was skeptical about your encode() fix, but I tried it anyway. Besides the fact it didn't work (below), I need the lines as Unicode (Py3 str). This abbreviated version removed many functions that normalize/filter the Unicode strings (hence the need for unicode_literals). – tahoar Mar 01 '18 at 14:05
  • Note that the offending line `subproc.stdin.write('%s\n' % sline.rstrip())` sends a Unicode str to the subprocess. The error message reports that Popen() internally applies the `'ascii' codec` and fails with non-ascii characters. Reading and sending raw UTF-8 lines to stdin.write() throws the exact same error. This happens to be Arabic text, but could easily be any valid UTF-8 Unicode. – tahoar Mar 01 '18 at 14:11
  • 1
    On Python 3.6, the `kwargs['encoding']` configuration is absolutely critical because it configures Popen() to use replace the 'ascii' codec with the 'UTF-8' codec. I need this same functionality on 3.5 if possible, but the 'encoding' key word was added in 3.6. – tahoar Mar 01 '18 at 14:13
  • Note that I found this question: https://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python. It seemed to be just what I need. However, when I use the method and confirm that sys.stdout.encoding (as well as stderr.encoding and stdin.encoding) were changed to UTF-8 in script_A.py and script_B.py, I still get the 'ascii' encode error, but only with Python 3.5 and only when script_B.py launches script_A.py. I need a 3.5 equivalent to the encoding key word, but alas, it looks like I'll just document the functional limit and move on. – tahoar Mar 01 '18 at 14:18
  • @tahoar: You didn't read everything I wrote. You don't need to pass `encoding` to `Popen` because you're feeding binary data to `stdin`, and rewrapping `stdout`/`stderr` in `io.TextIOWrapper` with an `encoding` specified there. Removing `universal_newlines` means the input & output are expected to be binary, which is why the rewrap is needed on the output. And `io.TextIOWrapper` always yields a Unicode supporting text type (`unicode` on Py2, `str` on Py3). My suggestions aren't piecemeal; reverting `Popen` to binary mode requires the other behaviors to ensure your desired encoding is used. – ShadowRanger Mar 01 '18 at 23:31
  • @tahoar: The only reason you'd have the error sending "raw" UTF-8 lines in my suggested way is if they weren't actually raw (e.g. you forgot to open the input file in binary mode, so you were sending it `unicode`, that it tried to encode to ASCII implicitly). – ShadowRanger Mar 01 '18 at 23:34
  • Thanks, ShadowRanger. Maybe I wasn't clear. * script_A.py, data and text encoding work when executed with Py2.7, Py3.5 and Py3.6 from a Linux terminal. * script_A.py code, data and encoding work when executed with Py2.7 and Py3.6 from another Python script. * script_A.py code, data and encoding fail when when executed with Py3.5 from another Python script. Changing `universal_newlines` failed. Using `open('txt', 'rb')` failed. I tried variations of io.TextIOWrapper(), but none of them worked. It's probable this could be the solution, but it's beyond me how to make it work. – tahoar Mar 03 '18 at 05:43