Streaming wrapper around program that writes to multiple output files

Question

There is a program (which I cannot modify) that creates two output files. I am trying to write a Python wrapper that invokes this program, reads both output streams simultaneously, combines the output, and prints to stdout (to facilitate streaming). How can I do this without deadlocking? The following proof of concept below works fine, but when I apply this approach to the actual program it deadlocks.

Proof of concept: this is a dummy program, bogus.py, that creates two output files like the program I'm trying to wrap.

#!/usr/bin/env python
from __future__ import print_function
import sys
with open(sys.argv[1], 'w') as f1, open(sys.argv[2], 'w') as f2:
    for i in range(1000):
        if i % 2 == 0:
            print(i, file=f1)
        else:
            print(i, file=f2)

And here is the Python wrapper that invokes the program and combines its two outputs (interleaving 4 lines from each at a time).

#!/usr/bin/env python
from __future__ import print_function
from contextlib import contextmanager
import os
import shutil
import subprocess
import tempfile

@contextmanager
def named_pipe():
    """
    Create a temporary named pipe.

    Stolen shamelessly from StackOverflow:
    http://stackoverflow.com/a/28840955/459780
    """
    dirname = tempfile.mkdtemp()
    try:
        path = os.path.join(dirname, 'named_pipe')
        os.mkfifo(path)
        yield path
    finally:
        shutil.rmtree(dirname)

with named_pipe() as f1, named_pipe() as f2:
    cmd = ['./bogus.py', f1, f2]
    child = subprocess.Popen(cmd)
    with open(f1, 'r') as in1, open(f2, 'r') as in2:
        buff = list()
        for i, lines in enumerate(zip(in1, in2)):
            line1 = lines[0].strip()
            line2 = lines[1].strip()
            print(line1)
            buff.append(line2)
            if len(buff) == 4:
                for line in buff:
                    print(line)

have you tried the obvious: `subprocess.check_call(['/program', '-', '-'])` and if the `program` doesn't understand `'-'` then pass `'/dev/stdout'` or `'/dev/fd/1'` or use a single named pipe instead of `'-'` depending on the system. Beware: If the child process is write to the files in their own threads concurrently then the output may interleave in a middle of a line. — jfs, Jun 07 '16 at 16:08
Yeah, the order of the output is problematic when `/dev/stdout` is used as the output file name. They really need to be treated as separate streams. — Daniel Standage, Jun 07 '16 at 16:34
what do you mean by "the order" here? Could you provide an example? 1- the write order to different files is undefined (there is no before or after from the outside of the child process e.g., if the child writes: `write(fd1, 'a'); write(fd2, 'b')` and you are reading from the files corresponding to `fd1`, `fd2` then there is no way for you to know what comes first `'a'` or `'b'`. 2- If you are talking about the order you might be talking about the manifestation of the internal buffering in the child process... — jfs, Jun 07 '16 at 17:02
..[continued] Perhaps the issue is the stdout buffering. See whether passing `/dev/stderr` or `/dev/tty` helps with convincing the child to line-buffer its output to the files. If you can't control the child's inside buffering when you might see a big chunk from one file then a big chunk from another file, etc — jfs, Jun 07 '16 at 17:03
I'm seeing big chunks of one file and then big chunks of the other file, regardless of whether I write to stdout, stderr, or tty. — Daniel Standage, Jun 07 '16 at 17:30

score 3 · Accepted Answer · edited May 23 '17 at 10:32

I'm seeing big chunks of one file and then big chunks of the other file, regardless of whether I write to stdout, stderr, or tty.

If you can't make the child to use line-buffering for files then a simple solution to read complete interleaved lines from the output files while the process is still running as soon as the output becomes available is to use threads:

#!/usr/bin/env python2
from subprocess import Popen
from threading import Thread
from Queue import Queue

def readlines(path, queue):
    try:
        with open(path) as pipe:
            for line in iter(pipe.readline, ''):
                queue.put(line)
    finally:
        queue.put(None)

with named_pipes(n=2) as paths:
    child = Popen(['python', 'child.py'] + paths)
    queue = Queue()
    for path in paths:
        Thread(target=readlines, args=[path, queue]).start()
    for _ in paths:
        for line in iter(queue.get, None):
            print line.rstrip('\n')

where named_pipes(n) is defined here.

pipe.readline() is broken for a non-blocking pipe on Python 2 that is why threads are used here.

To print a line from one file followed by a line from another:

with named_pipes(n=2) as paths:
    child = Popen(['python', 'child.py'] + paths)
    queues = [Queue() for _ in paths]
    for path, queue in zip(paths, queues):
        Thread(target=readlines, args=[path, queue]).start()
    while queues:
        for q in queues:
            line = q.get()
            if line is None:  # EOF
                queues.remove(q)
            else:
                print line.rstrip('\n')

If child.py writes more lines to one file than another file then the difference is kept in memory and therefore individual queues in queues may grow unlimited until they fill all the memory. You can set the max number of items in a queue but then you have to pass a timeout to q.get() otherwise the code may deadlock.

If you need to print exactly 4 lines from one output file then exactly 4 lines from another output file, etc then you could slightly modify the given code example:

    while queues:
        # print 4 lines from one queue followed by 4 lines from another queue
        for q in queues:
            for _ in range(4):
                line = q.get()
                if line is None:  # EOF
                    queues.remove(q)
                    break
                else:
                    print line.rstrip('\n')

It won't deadlock but it may eat all memory if your child process writes too much data into one file without writing enough into another file (only the difference is kept in memory—if the files are relatively equal; the program supports arbitrary large output files).

Awesome, this solved the deadlock issue. The order of the output is still an issue, though. It needs to be 4 lines from the first file, and then 4 lines from the next file, an so on. I'm still getting large chunks at a time from each file. — Daniel Standage, Jun 07 '16 at 19:45
Thought: replace `for path in paths:` with `for i, path in enumerate(paths):`, and then pass `i` as an argument to readlines. Would that allow me to accumulate the data into two separate queues and pop off 4 lines from each at a time? — Daniel Standage, Jun 07 '16 at 19:49
This is all complicated by the fact that this will often process very large amounts of data, so we can't just accumulate it all into memory and then deal with it at the end. — Daniel Standage, Jun 07 '16 at 19:50
@DanielStandage if the files are never too much out of sync then it is straightforward to modify the code to interleave each line (one line from the first file, the next line is from another file, etc)—though if the files are allowed to get out of sync (e.g., if the first is 10 times larger than another file then the modified code will deadlock (for sufficiently large files) while the code in my answer works) — jfs, Jun 07 '16 at 20:02

score 1 · Answer 2 · answered Jun 07 '16 at 07:18

1

Popen only spawns the process. You have to do something like child.communicate() to actually interact with it and obtain its output.

Also, I think you'll need to open the pipes for reading before starting the process.

answered Jun 07 '16 at 07:18

tripleee

175,061
34
275
318

If I understand correctly, communicate() will only return the stdin/stdout. I want to intercept the data the program writes to two output files. And on my proof-of-concept, opening the pipes for reading before calling Popen causes a deadlock. – Daniel Standage Jun 07 '16 at 14:48

Streaming wrapper around program that writes to multiple output files

2 Answers2

Linked