grep library output from within Python

Question

When calling a program from the command line, I can pipe the output to grep to select the lines I want to see, e.g.

printf "hello\ngood day\nfarewell\n" | grep day

I am in search for the same kind of line selection, but for a C library called from Python. Consider the following example:

import os

# Function which emulate a C library call
def call_library():
    os.system('printf "hello\ngood day\nfarewell\n"')

# Pure Python stuff
print('hello from Python')
# C library stuff
call_library()

When running this Python code, I want the output of the C part to be grep'ed for the string 'day', making the output of the code

hello from Python
good day

So far I has fiddled around with redirection of stdout, using the methods described here and here. I am able to make the C output vanish completely, or save it to a str and print it out later (which is what the two links are mainly concerned with). I am not however able to select which lines get printed based on its content. Importantly, I want the output in real time while the C library is being called, so I cannot just redirect stdout to some buffer and do some processing on this buffer after the fact.

The solution need only to work with Python 3.x on Linux. If in addition to line selection, the solution makes it possible for line editing, that would be even greater.

I think the following should be possible, but I do not know how to set it up

Redirect stdout to a "file" in memory.
Spawn a new thread which constantly reads from this file, does the selection based on line content, and writes the wanted lines to the screen, i.e. the original destination of stdout.
Call the C library
Join the two threads back together and redirect stdout back to its original destination (the screen).

I do not have a firm enough grasp of file descriptors and the like to be able to do this, nor to even know if this is the best way of doing it.

Edit

Note that the solution cannot simply re-implement the code in call_library. The code must call call_library, totally agnostic to the actual code which then gets executed.

Is there a reason you cannot capture all the output into a Python string and then `split()` it and extract the matching lines? (Massive amounts of output would be a dealbreaker for this scenario, obviously.) — tripleee, Nov 27 '17 at 18:39
Well, exactly how would you go about it? Remember that I want the line selection to be done from within the Python session that calls the library. Piping all of the output to some new Python session that does the `grep`'ing is not an option. If you think you have a solution, please do share. — jmd_dk, Nov 27 '17 at 18:55
The Stack Overflow question you link to has a `StringIO` wrapper for capturing stdout into a variable. — tripleee, Nov 27 '17 at 19:05
Yes, and I can indeed get that to work. But how can I take this further and achieve what I want? — jmd_dk, Nov 28 '17 at 14:17
Arguably, the library is broken if it doesn't allow you to capture its results to a memory buffer. Incidentally, I came across this vaguely related question: https://stackoverflow.com/questions/47381835/scipy-minimize-get-cost-function-vs-iteration — tripleee, Dec 08 '17 at 06:03

score 5 · Accepted Answer · answered Nov 30 '17 at 18:03

I'm a little confused about exactly what your program is doing, but it sounds like you have a C library that writes to the C stdout (not the Python sys.stdout) and you want to capture this output and postprocess it, and you already have a Python binding for the C library, which you would prefer to use rather than a separate C program.

First off, you must use a child process to do this; nothing else will work reliably. This is because stdout is process-global, so there's no reliable way to capture only one thread's writes to stdout.

Second off, you can use subprocess.Popen, because you can re-invoke the current script using it! This is what the Python multiprocessing module does under the hood, and it's not terribly hard to do yourself. I would use a special, hidden command line argument to distinguish the child, like this:

import argparse
import subprocess
import sys

def subprocess_call_c_lib():
    import c_lib
    c_lib.do_stuff()

def invoke_c_lib():
    proc = subprocess.Popen([sys.executable, __file__,
                             "--internal-subprocess-call-c-lib"
                             # , ...
                             ],
                            stdin=subprocess.DEVNULL,
                            stdout=subprocess.PIPE)
    for line in proc.stdout:
        # filter output from the library here
        # to display to "screen", write to sys.stdout as usual

    if proc.wait():
        raise subprocess.CalledProcessError(proc.returncode, "c_lib")

def main():
    ap = argparse.Parser(...)
    ap.add_argument("--internal-subprocess-call-c-lib", action="store_true",
                    help=argparse.SUPPRESS)
    # ... more arguments ...

    args = ap.parse_args()
    if args.internal_subprocess_call_c_lib:
        subprocess_call_c_lib()
        sys.exit(0)

    # otherwise, proceed as before ...

main()

This works. However, I need the object returned by `c_lib.do_stuff()` (which may be several GB i memory) to be available to the original process. Can this be achieved? — jmd_dk, Dec 01 '17 at 15:04
That's a major additional complication, and you might need to ask a new question just about that. I would normally suggest serializing it using `pickle` or a relative, but if it's several GB that's not going to be efficient. So the first thing I would look at is moving whatever work needs to be done on that object into the child process, if at all possible. The second thing I would look at is somehow allocating that object in a shared memory segment. — zwol, Dec 01 '17 at 18:06

score 0 · Answer 2 · answered Nov 29 '17 at 20:16

It is possible if the grepping thread prints to stderr, at least:

# Function which emulate a C library call
def call_library():
    os.system("echo hello")
    time.sleep(1.0)
    os.system("echo good day")
    time.sleep(1.0)
    os.system("echo farewell")
    time.sleep(1.0)
    os.system("echo done")


class GrepThread(threading.Thread):
    def __init__(self, r,):
        threading.Thread.__init__(self)
        self.r = r

    def run(self):
        while True:
            s = self.r.readline()
            if not s:
                break
            if "day" in s:
                print(s, file=sys.stderr)    

original_stdout_fd = sys.stdout.fileno()
# file descriptors r, w for reading and writing
r, w = os.pipe() 
r = os.fdopen(r)
os.dup2(w, original_stdout_fd)
sys.stdout = io.TextIOWrapper(os.fdopen(original_stdout_fd, 'wb'))

thread = GrepThread(r)
thread.start()
print("Starting", file=sys.stderr)
call_library()

Note that this does not close the thread nor clean things up, but it seems to work on my computer. It will print the lines as the function executes, not afterwards.

I would like the output to be printed to `stdout` and not `stderr`, as I treat the two streams differently. Also, the program seems to hang, probably because of the non-closed thread. Could you fix these issues? — jmd_dk, Nov 29 '17 at 21:21
For threading in Python, just search for that here on SO. And no, I don’t know how to print it to stdout, unfortunately. — Petter, Nov 30 '17 at 14:37

grep library output from within Python

I think the following should be possible, but I do not know how to set it up

Edit

2 Answers2