Using subprocess.Popen for Process with Large Output

Question

I have some Python code that executes an external app which works fine when the app has a small amount of output, but hangs when there is a lot. My code looks like:

p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
errcode = p.wait()
retval = p.stdout.read()
errmess = p.stderr.read()
if errcode:
    log.error('cmd failed <%s>: %s' % (errcode,errmess))

There are comments in the docs that seem to indicate the potential issue. Under wait, there is:

Warning: This will deadlock if the child process generates enough output to a stdout or stderr pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.

though under communicate, I see:

Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited.

So it is unclear to me that I should use either of these if I have a large amount of data. They don't indicate what method I should use in that case.

I do need the return value from the exec and do parse and use both the stdout and stderr.

So what is an equivalent method in Python to exec an external app that is going to have large output?

It seems "large" in the communicate documentation is *much larger* than you are likely expecting, and certainly much larger than common. For example, you can output 10MB of text and most systems would be fine with communicate. Output of 1GB when you only have 1GB of RAM would be another story. — , Feb 26 '10 at 04:29

score 19 · Accepted Answer · edited Sep 05 '17 at 10:12

You're doing blocking reads to two files; the first needs to complete before the second starts. If the application writes a lot to stderr, and nothing to stdout, then your process will sit waiting for data on stdout that isn't coming, while the program you're running sits there waiting for the stuff it wrote to stderr to be read (which it never will be--since you're waiting for stdout).

There are a few ways you can fix this.

The simplest is to not intercept stderr; leave stderr=None. Errors will be output to stderr directly. You can't intercept them and display them as part of your own message. For commandline tools, this is often OK. For other apps, it can be a problem.

Another simple approach is to redirect stderr to stdout, so you only have one incoming file: set stderr=STDOUT. This means you can't distinguish regular output from error output. This may or may not be acceptable, depending on how the application writes output.

The complete and complicated way of handling this is select (http://docs.python.org/library/select.html). This lets you read in a non-blocking way: you get data whenever data appears on either stdout or stderr. I'd only recommend this if it's really necessary. This probably doesn't work in Windows.

Since the specific case I'm dealing with will have a lot of stdout and a small amount or no stderr, I'm going to try out the file redirection Mark suggested first, but the more complete coverage of the issue is very helpful. — Tim, Jul 29 '09 at 19:12

vz0 · Answer 2 · 2016-12-02T10:20:49.807

10

Reading stdout and stderr independently with very large output (ie, lots of megabytes) using select:

import subprocess, select

proc = subprocess.Popen(cmd, bufsize=8192, shell=False, \
    stdout=subprocess.PIPE, stderr=subprocess.PIPE)

with open(outpath, "wb") as outf:
    dataend = False
    while (proc.returncode is None) or (not dataend):
        proc.poll()
        dataend = False

        ready = select.select([proc.stdout, proc.stderr], [], [], 1.0)

        if proc.stderr in ready[0]:
            data = proc.stderr.read(1024)
            if len(data) > 0:
                handle_stderr_data(data)

        if proc.stdout in ready[0]:
            data = proc.stdout.read(1024)
            if len(data) == 0: # Read of zero bytes means EOF
                dataend = True
            else:
                outf.write(data)

edited Dec 02 '16 at 10:20

answered Dec 02 '16 at 09:49

vz0

32,345
7
44
77

This by far makes the most sense to me in overcoming the in memory buffer issues. I even tried the subprocess `cmd` as `bash -c "cat /dev/urandom | tr -dc 'a-zA-Z0-9'"` which works great. My mental block were around what these lines mean - [1] `ready[0]` and why does [2] `len(proc.stdout.read(1024)) == 0` mean EOF? [3] Why not check the `len(proc.stderr.read(1024))`? [4] Why is read flush not needed? Sorry, several questions all lumped into one comment :/ – neowulf33 Nov 08 '17 at 02:12
1

@neowulf33 [1] ready is a list of lists, ready[0] is the list which can contain either stdout, stderr, or both. see select docs. [2] "An empty string is returned when EOF is encountered immediately." https://docs.python.org/2.7/library/stdtypes.html#file.read [3] because you'd lose the data! [4] I don't understand, how flush? – vz0 Nov 20 '17 at 13:44
Thanks! My bad - I was definitely half asleep when I wrote "read flush"! – neowulf33 Nov 20 '17 at 21:42
[5] why not read more than `1024` (1kb)? [6] how is `[proc.stdout, proc.stderr]` or `read[0]` related to `8192` (8kb)? Thanks! Doc links - [subprocess.Popen](https://docs.python.org/3/library/subprocess.html#subprocess.Popen) and [select.select](https://docs.python.org/3/library/select.html#select.select) – neowulf33 Nov 21 '17 at 18:49
@neowulf33 [5] [6] yes, probably the numbers I chose (1024, 8192) are kind of arbitrary, they are just large enough buffer sizes, AFAIK they don't have any special significance. – vz0 Nov 21 '17 at 22:03

score 6 · Answer 3 · answered Jul 24 '09 at 23:18

6

A lot of output is subjective so it's a little difficult to make a recommendation. If the amount of output is really large then you likely don't want to grab it all with a single read() call anyway. You may want to try writing the output to a file and then pull the data in incrementally like such:

f=file('data.out','w')
p = subprocess.Popen(cmd, shell=True, stdout=f, stderr=subprocess.PIPE)
errcode = p.wait()
f.close()
if errcode:
    errmess = p.stderr.read()
    log.error('cmd failed <%s>: %s' % (errcode,errmess))
for line in file('data.out'):
    #do something

answered Jul 24 '09 at 23:18

Mark Roddy

27,122
19
67
71

3

This can also easily deadlock. If the forked process writes more data than the OS will buffer to stderr before exiting with an error code, this code will sit forever waiting for it to exit, while the process sits on a blocking write to stderr waiting for you to read it. – Glenn Maynard Jul 24 '09 at 23:24
1) assumes the large data output is stderr which would be odd but not unheard of), 2) if stderr IS the source of large data amount the solution is same, make stderr a file as well – Mark Roddy Jul 24 '09 at 23:31
In this instance, the process can potentially have a great deal of stdout, but will not have much, if any, stderr, so this is a reasonable solution for me. – Tim Jul 29 '09 at 19:08

score 6 · Answer 4 · answered Jul 25 '09 at 19:14

Glenn Maynard is right in his comment about deadlocks. However, the best way of solving this problem is two create two threads, one for stdout and one for stderr, which read those respective streams until exhausted and do whatever you need with the output.

The suggestion of using temporary files may or may not work for you depending on the size of output etc. and whether you need to process the subprocess' output as it is generated.

As Heikki Toivonen has suggested, you should look at the communicate method. However, this buffers the stdout/stderr of the subprocess in memory and you get those returned from the communicate call - this is not ideal for some scenarios. But the source of the communicate method is worth looking at.

Another example is in a package I maintain, python-gnupg, where the gpg executable is spawned via subprocess to do the heavy lifting, and the Python wrapper spawns threads to read gpg's stdout and stderr and consume them as data is produced by gpg. You may be able to get some ideas by looking at the source there, as well. Data produced by gpg to both stdout and stderr can be quite large, in the general case.

Relevant links to the interesting methods - [`_open_subprocess`](https://bitbucket.org/vinay.sajip/python-gnupg/src/952281d4c966608403a23af76429f11df9e0a852/gnupg.py?at=default&fileviewer=file-view-default#gnupg.py-825) and [`_collect_output`](https://bitbucket.org/vinay.sajip/python-gnupg/src/952281d4c966608403a23af76429f11df9e0a852/gnupg.py?at=default&fileviewer=file-view-default#gnupg.py-903) — neowulf33, Nov 21 '17 at 18:44

score 6 · Answer 5 · answered Jul 24 '14 at 20:28

6

I had the same problem. If you have to handle a large output, another good option could be to use a file for stdout and stderr, and pass those files per parameter.

Check the tempfile module in python: https://docs.python.org/2/library/tempfile.html.

Something like this might work

out = tempfile.NamedTemporaryFile(delete=False)

Then you would do:

Popen(... stdout=out,...)

Then you can read the file, and erase it later.

answered Jul 24 '14 at 20:28

Mariano Anaya

1,246
10
11

Such a straightforward solution! – Michal Charemza Nov 21 '21 at 17:46

score 2 · Answer 6 · answered Jul 24 '09 at 23:24

2

You could try communicate and see if that solves your problem. If not, I'd redirect the output to a temporary file.

answered Jul 24 '09 at 23:24

Heikki Toivonen

30,964
11
42
44

1

Since communicate explicitly warns away from usage if you have a great deal of output, I'm going to look at the other options. – Tim Jul 29 '09 at 19:13

score -1 · Answer 7 · answered Mar 29 '18 at 05:41

Here is simple approach which captures both regular output plus error output, all within Python so limitations in stdout don't apply:

com_str = 'uname -a'
command = subprocess.Popen([com_str], stdout=subprocess.PIPE, shell=True)
(output, error) = command.communicate()
print output

Linux 3.11.0-20-generic SMP Fri May 2 21:32:55 UTC 2014

and

com_str = 'id'
command = subprocess.Popen([com_str], stdout=subprocess.PIPE, shell=True)
(output, error) = command.communicate()
print output

uid=1000(myname) gid=1000(mygrp) groups=1000(cell),0(root)

Using subprocess.Popen for Process with Large Output

7 Answers7

Linked

Related