Python subprocess losing 10% of a program's stdout

Question

I have a program that needs to be called as a subprocess with python. The program has been written in java. yeah, i know...

anyway, I need to capture all of the output from said program.

Unfortunately, when I call subprocess.popen2 or subprocess.Popen with communicate[0], I'm losing around 10% of the output data when I'm using a subprocess.PIPE assigned to stdout AND when i'm using a file descriptor (the return from an open) assigned to stdout.

The documentation in subprocess is pretty explicit that using subprocess.PIPE is volatile if you're trying to capture all of the output from a child process.

I'm currently using pexpect to dump the ouput into a tmp file but that's taking forever for obvious reasons.

I'd like to keep all the data in memory to avoid disk writes.

any recommendations are welcome! thanks!

import subprocess

cmd = 'java -Xmx2048m -cp "/home/usr/javalibs/class:/home/usr/javalibs/libs/dependency.jar" --data data --input input" 

# doesn't get all the data
#
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True)
output = p.communicate()[0]

OR
# doesn't get all the data
#
fd = open("outputfile",'w')
p = subprocess.Popen(cmd, stdout=fd, shell=True)
p.communicate()
fd.close() # tried to use fd.flush() too.

# also tried
# p.wait() instead of p.communicate(), but wait doesn't really wait for the java program to finish running - it doesn't block

OR
# also fails to get all the data
#
import popen2
(rstdout, rstdin) = popen2.popen2(cmd)

Expected output is a series of ascii lines (a couple thousand). the lines contain a number and an end of line character

0\n
1\n
4\n
0\n
...

Is it possible some of the output is being written to stderr? — Jeremiah, May 21 '12 at 17:46
hey, just trying to capture stdout (not stderr). the output is a number and an end of line character - it's expecting all ascii output — ct_, May 21 '12 at 17:49
which "10%" are you missing? Is it at the beginning, the end? What output were you expecting? — Joel Cornett, May 21 '12 at 17:55
the last 10% of the output. posted an update to the question to clarify. thanks! — ct_, May 21 '12 at 18:00
"The documentation in subprocess is pretty explicit that using subprocess.PIPE is volatile if you're trying to capture all of the output from a child process." <-- if the documentation says this, it is entirely wrong. PIPE is perfectly safe and will get all output on the connected fd if properly used. — the paul, May 21 '12 at 18:02
@the paul not quite what i'm looking for as a response. if it's volatile where can i read about how to do it correctly. and when i use the fd i'm still not getting all the data - i suspect assigning an fd to stdout has the same issues as assigning subprocess.PIPE to stdout. — ct_, May 21 '12 at 18:03
"Note Do not use stdout=PIPE or stderr=PIPE with this function. As the pipes are not being read in the current process, the child process may block if it generates enough output to a pipe to fill up the OS pipe buffer." from [subprocess docs](http://docs.python.org/library/subprocess.html) — jadkik94, May 21 '12 at 18:04
"but wait doesn't really wait for the java program to finish running - it doesn't block" <-- also completely inaccurate. Are you sure your subprocess is working the way you expect it to? — the paul, May 21 '12 at 18:04
Right, subprocess.PIPE ought to be used with `communicate()` or otherwise with caution, to keep the input and output fds from blocking each other. That's partially what I meant by "properly used". — the paul, May 21 '12 at 18:05
@the paul. it's not reading all the output. i'm expecting ~1300+ lines of numbers with new line characters - depending on the inputs. and yes wait doesn't really "wait" my python script continues to execute past where i fork out the subprocess. as for accuracy, i'm explaining the problem i'm having. — ct_, May 21 '12 at 18:06
Are you sure your java subprocess isn't itself forking? That might explain why your `wait()` call appears not to be blocking. — the paul, May 21 '12 at 18:08
@jadkik94 & the paul so guys i appreciate your time but you're not really helping. i've already stated that PIPE has issues when calling subprocess (i've read the documentation several times) so how do you do this correctly? — ct_, May 21 '12 at 18:08
There's a warning in [communicate](http://docs.python.org/library/subprocess.html#subprocess.Popen.communicate) about large data, but it still is very unclear for an alternative... — jadkik94, May 21 '12 at 18:08
I assure you, PIPE does not have "issues" if you use `communicate` and your data fits in memory (if it didn't, you'd see a much more obvious failure). The notes in the docs are to keep people from trying to use it in an inappropriate way. I *would* like to help, but it seems like you really want to blame the wrong part of the system. — the paul, May 21 '12 at 18:10
To be more specific, using `subprocess.PIPE` or assigning an fd to the subprocess's output is essentially the exact same thing that your shell does when you do output redirection to a file (the OS's `dup2()` system call). You can safely assume that part is working. You might try adding "` | tee outputcopy`" at the end of your command there; then you could check that `outputcopy` has all the lines you expect. If it doesn't, maybe your java program isn't working quite right. — the paul, May 21 '12 at 18:19
See if that can help too: [another SO question](http://stackoverflow.com/questions/1180606/using-subprocess-popen-for-process-with-large-output) — jadkik94, May 21 '12 at 18:19
@jadkik94 that's very unlikely to be a problem here; "a couple thousand" lines of a few characters each would easily fit in memory, on any conceivable machine capable of running Python or Java at all. — the paul, May 21 '12 at 18:20

xbtsw · Answer 1 · 2012-05-21T18:42:03.480

I had used subprocess with much larger output on stdout but haven't seen such problem. It's hard to conclude what's the root cause from what you've shown. I would check following:

Since p.wait() didn't work for you. It could be the case that when you reading your PIPE your java program still busy printing the last 10%. Get p.wait() straight first:

Insert a large enough wait (say 30 secs) before you read the PIPE, does your 10% shows up?
It's doubtful that p.wait() doesn't block on your java program. Does your java program further subprocessing other program?
check the return value of p.wait(). Did your java program terminated normally?

If the problem not lays in your concurrency model, then check if you are printing correctly in your java program:

What function you used in your java program to print to stdout? Does it prone to or ignoring IOException?
Did you flush the stream correctly? The last 10% could be in your buffer without proper flushing when your java program terminates.

will get right back with you - going to work on jdi's notes in a bit. thanks! — ct_, May 22 '12 at 17:18

score 2 · Answer 2 · answered May 21 '12 at 19:06

It must be something related to the process you are actually calling. You can verify this by doing a simple test with another python script that echos out lines:

out.py

import sys

for i in xrange(5000):
    print "%d\n" % i

sys.exit(0)

test.py

import subprocess

cmd = "python out.py"
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True)
output = p.communicate()[0]

print output

So you can verify that its not the size of the data that is the issue, but rather the communication with the process you are calling.

You should also confirm the version of python you are running, as I have read about past issues concerning the internal buffer of Popen (but using a separate file handle as you have suggested normally fixed that for me).

It would be a buffer issue if the subprocess call was hanging indefinitely. But if the process is completing, just lacking lines, then Popen is doing its job.

Python subprocess losing 10% of a program's stdout

2 Answers2