How to monitor the total size of output produced by a subprocess in real time?

Question

The code below is a toy example of the actual situation I am dealing with¹. (Warning: this code will loop forever.)

import subprocess
import uuid
class CountingWriter:
    def __init__(self, filepath):
        self.file = open(filepath, mode='wb')
        self.counter = 0

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.file.close()

    def __getattr__(self, attr):
        return getattr(self.file, attr)

    def write(self, data):
        written = self.file.write(data)
        self.counter += written
        return written

with CountingWriter('myoutput') as writer:
    with subprocess.Popen(['/bin/gzip', '--stdout'],
                          stdin=subprocess.PIPE,
                          stdout=writer) as gzipper:
        while writer.counter < 10000:
            gzipper.stdin.write(str(uuid.uuid4()).encode())
            gzipper.stdin.flush()
            writer.flush()
            # writer.counter remains unchanged

        gzipper.stdin.close()

In English, I start a subprocess, called gzipper, which receives input through its stdin, and writes compressed output to a CountingWriter object. The code features a while-loop, depending on the value of writer.counter, that at each iteration, feeds some random content to gzipper.

This code does not work!

More specifically, writer.counter never gets updated, so execution never leaves the while-loop.

This example is certainly artificial, but it captures the problem I would like to solve: how to terminate the feeding of data into gzipper once it has written a certain number of bytes.

Q: How must I change the code above to get this to work?

FWIW, I thought that the problem had to do with buffering, hence all the calls to *.flush() in the code. They have no noticeable effect, though. Incidentally, I cannot call gzipper.stdout.flush() because gzipper.stdout is not a CountingWriter object (as I had expected), but rather it is None, surprisingly enough.

^{¹ In particular, I am using a /bin/gzip --stdout subprocess only for the sake of this example, because it is a more readily available alternative to the compression program that I am actually working with. If I really wanted to gzip-compress my output, I would use Python's standard gzip module.}

I don't think there is a simple way of doing this. the object thats actually passed to the new process is a fileno. you can try creating an actual pipe (if you are in linux) and have your parent process listen to it with the child writing to it, or have the child write to a regular file and have the parent "tail" it, an example can be seen [in this answer](https://stackoverflow.com/a/12523371/7540911) — Nullman, May 18 '23 at 14:15
yes - creating an intermediate PIPE would be a way of doing it - but i the final destination is a regular file, it is much simpler to simply `stat` it whenever one wants to know its size. — jsbueno, May 18 '23 at 14:27

jsbueno · Answer 1 · 2023-05-18T14:25:04.663

Your "writer" is an arbitrary Python object - subprocess piping needs real files - as those will be used by their O.S. handlers in the subprocess. The only reason you get any data written to the output file at all is because you proxied getattr - so the code in subprocess have retrieved the fileno() for your proxied file - the real, operating system level, file is the only thing seen in the actual subprocess (gzip) - not your writer object.

What can be done, instead, is promote counter to a property which will call stat on your output file:

import subprocess
import uuid
import os

class CountingWriter:
    def __init__(self, filepath):
        self.filepath = filepath
        
    @property
    def counter(self):
        if not hasattr(self, "file"):
            return 0

        return os.stat(self.filepath).st_size

    def __enter__(self):
        # by bringing the actual file openning into the `__enter__`,
        # we avoid side effects just by instantiating the object.
        self.file = open(self.filepath, mode='wb')
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.file.close()
        self.file = self.filepath 

    def __getattr__(self, attr):
        return getattr(self.file, attr)

    def write(self, data):
        return self.file.write(data)

Thanks! That's a cool idea. Unfortunately, when I replace my code with yours, I still get that `writer.counter` always returns 0. — kjo, May 18 '23 at 14:37
Correction: if I wait long enough, `writer.counter` finally becomes non-zero, but the smallest non-zero value it ever shows is 65536. It sure looks like a buffering issue, but I have not found a way to prevent it. In particular, I found no value of `Popen`'s `bufsize` option that will do it. Also, adding `PYTHONUNBUFFURED=1` to the running environment had no noticeable effect. — kjo, May 18 '23 at 14:51
try using `os.stat(self.file)` instead of `os.stat(self.filename)` - it will likely not change, but there is a chance. Yes, the thing is that gzip itself will just write 64K at a time to the file, and it is possible that even with a finner grained "real" pipe implementation that won't change - You may have to ultimatelly compress in process with Python zibfile and other libs - that way you can have full control of the stream, and force flushing of the file. — jsbueno, May 18 '23 at 15:17

How to monitor the total size of output produced by a subprocess in real time?

1 Answers1