Python Popen - wait vs communicate vs CalledProcessError

Question

Continuing from my previous question I see that to get the error code of a process I spawned via Popen in python I have to call either wait() or communicate() (which can be used to access the Popen stdout and stderr attributes):

app7z = '/path/to/7z.exe'
command = [app7z, 'a', dstFile.temp, "-y", "-r", os.path.join(src.Dir, '*')]
process = Popen(command, stdout=PIPE, startupinfo=startupinfo)
out = process.stdout
regCompressMatch = re.compile('Compressing\s+(.+)').match
regErrMatch = re.compile('Error: (.*)').match
errorLine = []
for line in out:
    if len(errorLine) or regErrMatch(line):
        errorLine.append(line)
    if regCompressMatch(line):
        # update a progress bar
result = process.wait() # HERE
if result: # in the hopes that 7z returns 0 for correct execution
    dstFile.temp.remove()
    raise StateError(_("%s: Compression failed:\n%s") % (dstFile.s, 
                       "\n".join(errorLine)))

However the docs warn that wait() may deadlock (when stdout=PIPE, which is the case here) while communicate() might overflow. So:

what is the proper thing to use here ? Note that I do use the output

how exactly should I use communicate ? Would it be:

process = Popen(command, stdout=PIPE, startupinfo=startupinfo)
out = process.communicate()[0]
# same as before...
result = process.returncode
if result: # ...

not sure about blocking and the memory errors

Any better/more pythonic way of handling the problem ? I do not think that subprocess.CalledProcessError or the subprocess.check_call/check_output apply in my case - or do they ?

DISCLAIMER: I did not write the code, I am the current maintainer, hence question 3.

There should be one-- and preferably only one --obvious way to do it

unrelated: your line-handling code might be broken e.g., `regErrMatch(line)` is called only once. — jfs, Jun 22 '15 at 16:42
@J.F.Sebastian: hey thanks - I think the intention was that 'once there is something in `errorLine` then add all remaining lines' (puzzled me too when I saw it) - probably once there is an error means all else will fail — Mr_and_Mrs_D, Jun 22 '15 at 17:33

score 20 · Accepted Answer · edited May 23 '17 at 11:46

20

about the deadlock: It is safe to use stdout=PIPE and wait() together iff you read from the pipe. .communicate() does the reading and calls wait() for you
about the memory: if the output can be unlimited then you should not use .communicate() that accumulates all output in memory.

what is the proper thing to use here ?

To start subprocess, read its output line by line and to wait for it to exit:

#!/usr/bin/env python
from subprocess import Popen, PIPE

process = Popen(command, stdout=PIPE, bufsize=1)
with process.stdout:
    for line in iter(process.stdout.readline, b''): 
        handle(line)
returncode = process.wait()

This code does not deadlock due to a finite OS pipe buffer. Also, the code supports commands with unlimited output (if an individual line fits in memory).

iter() is used to read a line as soon as the subprocess' stdout buffer is flushed, to workaround the read-ahead bug in Python 2. You could use a simple for line in process.stdout if you don't need to read lines as soon as they are written without waiting for the buffer to fill or the child process to end. See Python: read streaming input from subprocess.communicate().

If you know that the command output can fit in memory in all cases then you could get the output all at once:

#!/usr/bin/env python
from subprocess import check_output

all_output = check_output(command)

It raises CalledProcessError if the command returns with a non-zero exit status. Internally, check_output() uses Popen() and .communicate()

There should be one-- and preferably only one --obvious way to do it

subprocess.Popen() is the main API that works in many many cases. There are convenience functions/methods such as Popen.communicate(), check_output(), check_call() for common use-cases.

There are multiple methods, functions because there are multiple different use-cases.

edited May 23 '17 at 11:46

Community

1
1

answered Jun 22 '15 at 16:14

jfs

399,953
195
994
1,670

Alright - the one thing I do not quite get is what `for line in iter(process.stdout.readline, b''):` buys us instead of `for line in process.stdout:` - I followed your link and I understand the synax but still... Also is closing the pipe necessary ? (I guess yes) – Mr_and_Mrs_D Jun 22 '15 at 18:54
@Mr_and_Mrs_D: `with`-statement closes the pipe like with ordinary files. `iter()` is necessary to workaround [the read-ahead bug in Python 2](https://bugs.python.org/issue3907) -- it is not necessary if you don't need "real time" output. – jfs Jun 22 '15 at 18:59
Excellent, excellent thanks - I understand that `with` closes the pipe (:D) - I was not sure it's needed - please do add the info on the bug on the answer as this (as noted in the bug) is against the "one obvious way" and it is not obvious at all why a simple `for line in out` would not do. Thanks indeed – Mr_and_Mrs_D Jun 22 '15 at 19:03
I've misunderstood about the closing. Yes, it is necessary. File descriptors are limited resource and you can run out of them (it is usually doesn't matter for scripts but you must pay attention in the server code). – jfs Jun 22 '15 at 19:10
Note: Off hand profiling suggests that `bufsize=-1` gives me an improvement of 10 ms vs `bufsize=1` – Mr_and_Mrs_D Jun 30 '15 at 14:46
@Mr_and_Mrs_D: `bufsize=1` means "line-buffered". Otherwise, it should be equivalent to `bufsize=-1` (the buffer is size is the same e.g., `BUFSIZ`; though CPython 2 uses `type = _IOLBF` instead of `type = _IOFBF` then it calls [`setvbuf()`](http://www.gnu.org/software/libc/manual/html_node/Controlling-Buffering.html)). Your result is probably a benchmark artifact. – jfs Jun 30 '15 at 15:21
Yes thanks, I know the difference I just don't have any rules of thumb regarding when to use one vs the other - but probably this is another question. Apparently line buffered (in my case) is (microscopically) less efficient for some reason (320 ms vs 330 ms) - will run more thorough tests and keep you posted, thanks again :) – Mr_and_Mrs_D Jun 30 '15 at 15:34
1

@Mr_and_Mrs_D: I use `bufsize=1` *to document my intent* to handle the stream line by line. Also, you could avoid an explicit `.flush()` call on Python 2 and the most recent Python 3 versions while sending a line to the child process. – jfs Jun 30 '15 at 15:52
I left it like this - exactly to document my intent to handle the stream line by line :) – Mr_and_Mrs_D Jun 30 '15 at 16:13
OK I'll bite... *" if the output can be unlimited then you should not use `.communicate()`"*. I'm using `.communicate()` for the first time and cannot think of a single command that generates infinite output to my 32 GB RAM. But lets say I redirect infinite output to my 1.5TB of combined non-volatile storage, where do I redirect infinite output after that? – WinEunuuchs2Unix Sep 16 '20 at 02:48
@WinEunuuchs2Unix: the point is that you can consume the output on the fly (you don't need infinite space to consume infinite output) -- all you need is enough memory to store a single line that is passed to the `handle()` function. – jfs Sep 16 '20 at 18:15
@jfs My current function is to report results from `cp` (copy file) and `touch` (set file modification time) so it is usually 0 lines and occasionally 1 line. The `handle()` function might be useful for `ffplay` revamping where I want to pass to control keys to `stdin` and read `stderr` for current settings in real time. I'll upvote your answer as it is good, I just question the justification of dealing with "unlimited" output which OOM-KIller would likely deal with anyway. – WinEunuuchs2Unix Sep 16 '20 at 23:31
1

On Python 3.10.1 I get _RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used_ – Brecht Machiels May 31 '23 at 11:42

Python Popen - wait vs communicate vs CalledProcessError

1 Answers1

Linked