Using subprocess to launch hadoop job but can't get log from stdout

Question

To simplify my question, here'a a python script:

from subprocess import Popen, PIPE
proc = Popen(['./mr-task.sh'], shell=True, stdout=PIPE, stderr=PIPE)
while True:
    out = proc.stdout.readline()
    print(out)

Here's mr-task.sh, it starts a mapreduce job:

hadoop jar xxx.jar some-conf-we-don't-need-to-care

When I run ./mr-task, I could see log printed on the screen, something like:

14/12/25 14:56:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/12/25 14:56:44 INFO snappy.LoadSnappy: Snappy native library loaded
14/12/25 14:57:01 INFO mapred.JobClient: Running job: job_201411181108_16380
14/12/25 14:57:02 INFO mapred.JobClient:  map 0% reduce 0%
14/12/25 14:57:28 INFO mapred.JobClient:  map 100% reduce 0%

But I can't get these output running python script. I tried removing shell=True or fetch stderr, still got nothing.

Does anyone have any idea why this happens?

What do you get if you replace the line `out = proc.stdout.readline()` with `out = proc.stderr.readline()`? — falsetru, Dec 25 '14 at 07:37
@falsetru AAAh, err works. In fact I tried adding `proc.stderr.readline()` as I said in question, but didn't remove `proc.stdout.readline()` so it stucks there. Thank you! — laike9m, Dec 25 '14 at 07:41

score 3 · Answer 1 · edited May 23 '17 at 12:12

You could redirect stderr to stdout:

from subprocess import Popen, PIPE, STDOUT

proc = Popen(['./mr-task.sh'], stdout=PIPE, stderr=STDOUT, bufsize=1)
for line in iter(proc.stdout.readline, b''):
    print line,
proc.stdout.close()
proc.wait()

See Python: read streaming input from subprocess.communicate().

in my real program I redirect stderr to stdout and read from stdout, so bufsize is not needed, is it?

The redirection of stderr to stdout and bufsize are unrelated. Changing bufsize might affect the time performance (the default bufsize=0 i.e., unbuffered on Python 2). Unbuffered I/O might be 10..100 times slower. As usual, you should measure the time performance if it is important.

Calling Popen.wait/communicate after the subprocess has terminated is just for clearing zombie process, and these two methods have no difference in such case, correct?

The difference is that proc.communicate() closes the pipes before reaping the child process. It releases file descriptors (a finite resource) to be used by a other files in your program.

about buffer, if output fill buffer maxsize, will subprocess hang? Does that mean if I use the default bufsize=0 setting I need to read from stdout as soon as possible so that subprocess don't block?

No. It is a different buffer. bufsize controls the buffer inside the parent that is filled/drained when you call .readline() method. There won't be a deadlock whatever bufsize is.

The code (as written above) won't deadlock no matter how much output the child might produce.

The code in @falsetru's answer can deadlock because it creates two pipes (stdout=PIPE, stderr=PIPE) but it reads only from one pipe (proc.stderr).

There are several buffers between the child and the parent e.g., C stdio's stdout buffer (a libc buffer inside child process, inaccessible from the parent), child's stdout OS pipe buffer (inside kernel, the parent process may read the data from here). These buffers are fixed they won't grow if you put more data into them. If stdio's buffer overflows (e.g., during a printf() call) then the data is pushed downstream into the child's stdout OS pipe buffer. If nobody reads from the pipe then then this OS pipe buffer fills up and the child blocks (e.g., on write() system call) trying to flush the data.

To be concrete, I've assumed C stdio's based program and POSIXy OS.

The deadlock happens because the parent tries to read from the stderr pipe that is empty because the child is busy trying to flush its stdout. Thus both processes hang.

Thank you for detailed suggestions! some quesions: 1. in my real program I redirect stderr to stdout and read from stdout, so `bufsize` is not needed, is it? 2. Calling `Popen.wait/communicate` after the subprocess has terminated is just for clearing zombie process, and these two methods have no difference in such case, correct? — laike9m, Dec 26 '14 at 03:42
1. The redirection and `bufsize` are unrelated. Changing `bufsize` may affect the time performance (the default `bufsize=0` i.e., unbuffered on Python 2). Also, [@JinghaoShi reports that it has other consequences](http://stackoverflow.com/questions/2715847/python-read-streaming-input-from-subprocess-communicate/17698359#comment39074982_17698359) -- though it might be a fluke. 2. `.communicate()` also closes the pipes (to avoid leaking file descriptors) — jfs, Dec 26 '14 at 03:48
emmm, about buffer, if output fill buffer maxsize, will subprocess hang? Does that mean if I use the default `bufsize=0` setting I need to read from stdout as soon as possible so that subprocess don't block? — laike9m, Dec 26 '14 at 04:18
upvote your answer, it seems I need more knowledge on OS to fully understand this topic. — laike9m, Dec 26 '14 at 06:07

falsetru · Accepted Answer · 2014-12-26T01:04:57.763

0

One possible reaosn is that the output is printed to standard error instead of standard output.

Try to replace stdout with stderr:

from subprocess import Popen, PIPE
proc = Popen(['./mr-task.sh'], stdout=PIPE, stderr=PIPE)
while True:
    out = proc.stderr.readline()  # <----
    if not out:
        break
    print(out)

edited Dec 26 '14 at 01:04

answered Dec 25 '14 at 07:44

falsetru

357,413
63
732
636

I think `proc.wait()` is unnecessary. – laike9m Dec 25 '14 at 07:48
2

@laike9m: `proc.wait()` is necessary to avoid zombies. Also, if the the child process generates enough output on stdout to fill its OS pipe buffers (around 64KB on my box) then your script **deadlocks** (small tests might run fine but it may hang in production -- very bad). Do *not* use `stream=PIPE` unless you read from the stream! You could redirect `stderr=STDOUT` and read from `proc.stdout` instead. Otherwise you need to read from *both* stdout and stderr concurrently (using [threads](http://stackoverflow.com/a/25755038/4279), [async.io](http://stackoverflow.com/a/25960956/4279)) – jfs Dec 26 '14 at 00:05
1

[do not use a list argument together with `shell=True`](http://bugs.python.org/issue21347). Also, either use `print out,` or `sys.stdout.buffer.write(out)` depending on Python version to avoid doubling all newlines. – jfs Dec 26 '14 at 00:07
@J.F.Sebastian, Thank you for the nice explanation. I just removed `shell=True`. Except that, I will leave the code as is, otherwise my code will be exactly same as yours. – falsetru Dec 26 '14 at 01:08
@J.F.Sebastian I didn't know `shell=True` will use 2nd and following arguments for shell itself, thx for pointing this out! – laike9m Dec 26 '14 at 03:32

Using subprocess to launch hadoop job but can't get log from stdout

2 Answers2