2

What I wanted to happen:

So my goal was to write a function that leverages subprocess to run a command and read the stdout, whether it be immediate or delayed, line by line as it comes. And to do that in a non-blocking, asynchronous way.

I also wanted to be able to pass a function to be called each time a new stdout line is read.

What happened instead:

Until the process being run is completely finished / killed, the output isn't handled / printed as expected. All the correct output happens, but I expected it to print in real-time as the output is polled. Rather, it waits until the entire process finishes running, then prints all the expected output.


What I tried:

So I wrote a simple test script lab_temp.py to provide some output:

from time import sleep
for i in range(10):
    print('i:', i)
    sleep(1)

And a function set_interval.py which I mostly copied from some SO answer (although I'm sorry I don't recall which answer to give credit):

import threading

def set_interval(func, sec):
    def func_wrapper():
        t = set_interval(func, sec)
        result = func()
        if result == False:
            t.cancel()
    t = threading.Timer(sec, func_wrapper)
    t.start()
    return t

And then a function call_command.py to run the command and asynchronously poll the process at some interval for output, until it's done. I'm only barely experienced with asynchronous code, and that's probably related to my mistake, but I think the async part is being handled behind the scenes by threading.Timer (in set_interval.py).

call_command.py:

import subprocess
from set_interval import set_interval

def call_command(cmd, update_func=None):
    p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE, encoding='utf-8')
    def polling(): # Replaces "while True:" to convert to non-blocking
        for line in iter(p.stdout.readline, ''):
            if update_func:
                update_func(line.rstrip())
            
        if p.poll() == 0:
            print('False')
            return False # cancel interval
        else:
            print('True')
            return True # continue interval
    set_interval(polling, 1)

And each of these functions have basic tests:

set_interval.test.py (seems to run as expected):

from set_interval import set_interval

i = 0
def func():
    global i
    i += 1
    print(f"Interval: {i}...")
    if i > 5:
        return False
    else:
        return True
    
set_interval(func, 2)
print('non blocking')

call_command.test.py (results in the wrong behavior, as described initially):

from call_command import call_command

def func(out):
    print(out) # <- This will print in one big batch once
               # the entire process is complete.
    
call_command('python3 lab_temp.py', update_func=func)
print('non-blocking') # <- This will print right away, so I 
                      # know it's not blocked / frozen.

What have I gotten wrong here causing the deviation from expectation?


Edit: Continued efforts...

import subprocess
from set_interval import set_interval


def call_command(cmd):
    p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE, encoding='utf-8')

    def polling():
        line = p.stdout.readline().strip()
        if not line and p.poll() is not None:
            return False
        else:
            print(line)
            return True

    set_interval(polling, 1)

Doesn't work. Nearly identical issues.

J.Todd
  • 707
  • 1
  • 12
  • 34
  • Would I be wrong to say that Python's implimentation of asynchronous code is terrible? In JavaScript, all of these hours of fighting with this problem are solved by one simple built in function `setInterval(func, delay)`. But alas, Python is the language of data science and I'm stuck with it. – J.Todd May 11 '22 at 15:23
  • All of the approaches I see on SO answers related to this use platform-conditional logic, which is even more frustrating. Why do we need to worry about the platform to just read the output of a process in an asynchronous way... This seems so much more complicated than it should be... – J.Todd May 11 '22 at 15:30

1 Answers1

2

The problem is located in your command. The lab_temp.py script uses the print function that prints to sys.stdout by default. sys.stdout is block-buffered by default. Since the buffer is large enough to accept the whole script's output, it gets flushed no sooner than at the end.

To fix this, you can either use the sys.stdout's flush method:

from time import sleep
import sys
for i in range(10):
    print('i:', i)
    sys.stdout.flush()
    sleep(1)

or use print's flush parameter:

from time import sleep
for i in range(10):
    print('i:', i, flush=True)
    sleep(1)

or run Python interpreter with the -u option:

call_command(['python3', '-u', 'lab_temp.py'])

or set the PYTHONUNBUFFERED environment variable to a non-empty string

import subprocess
from set_interval import set_interval


def call_command(cmd):
    p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE, encoding='utf-8', env={'PYTHONUNBUFFERED':'1'})

    def polling():
        line = p.stdout.readline().strip()
        if not line and p.poll() is not None:
            return False
        else:
            print(line)
            return True

    set_interval(polling, 1)

BTW, in order to avoid threads, you might want to use asyncio:

import asyncio

async def call_command(cmd):
   p = await asyncio.create_subprocess_exec(cmd[0], *cmd[1:], stderr=asyncio.subprocess.PIPE, stdout=asyncio.subprocess.PIPE)
   async for line in p.stdout:
       line = line.strip().decode('utf-8')
       print(line)
radekholy24
  • 418
  • 2
  • 12
  • We've been working on this all day and for the past few hours have been tinkering with asyncio solutions. Here's our problem (and I just tested it in your code too): We want the output to print in real-time as the process runs. Instead it waits until the end to run. Why, even with your asyncio solution, does it run the entire command before showing output when the command line has no issue printing the output in real-time? For example, we're testing it with `git clone`. – J.Todd May 12 '22 at 18:41
  • Also about the flush method, we found that, but it's a very backwards solution. It's chaning the source of the logs instead of the code we actually control. We happen to control the source code of the logging process in this minimal example, but not, for example, when we experience the same problem doing `git clone` and numerous other commands. Clearly the command line is capable of outputting the stdout in real time without chaning the source of the output, why not our python code? – J.Todd May 12 '22 at 18:47
  • The `asyncio` suggestion wasn't meant to solve the problem of the program waiting until the end. It was meant just as an alternative to `threading`. – radekholy24 May 12 '22 at 19:53
  • Both the `asyncio` and the `threading` solutions depend on when the subprocess flushes its output. That can be solved by the other suggestions I gave you. If you have no control over the source code of the logging process, you heavily depend on the author to provide you a way to configure the process to flush early (such as the `GIT_FLUSH` or the `PYTHONUNBUFFERED` environment variable). – radekholy24 May 12 '22 at 19:56
  • Actually, it turns out that I am wrong. There seem to be ways to unbuffer processes that you cannot modify: https://unix.stackexchange.com/a/25375 – radekholy24 May 12 '22 at 20:06
  • I assumed there *must* be, since command prompt clearly does so. – J.Todd May 12 '22 at 20:47
  • Unfortunately that solution is only viable on *nix but maybe it'll be a keyword to find the equivalent on Windows. Surely someone can explain how command prompt and power shell manage it. – J.Todd May 12 '22 at 20:48
  • Programs usually detect whether they are run from a shell or not and change their output behavior based on that. If they do not provide options to disable buffering, they sometimes provide options to change the output behavior (e.g. `/batch` or `/tty`). Maybe this can help: https://stackoverflow.com/q/11516258/542196 But still... All these "unbuffer" programs work in a way that they trick the subprocesses somehow. The right way is to ask the author of the subprocess to provide the right configuration options anyway. – radekholy24 May 13 '22 at 06:57
  • Thanks for the clarification, that does explain why the behavior would be different. I saw those programs, but I tried the windows one and it did nothing for me. No error, not output, process didn't run. I wonder why we can't just use python to trick a subprocess to think it's running in a shell interactively without some special binary... Oh well. Getting real-time output in a reliable, universal way just seems impossible at this point, unfortunate. I never imagined it could be so complicated just to get stdout in real-time or that programs had to specify anything beyond printing. – J.Todd May 13 '22 at 07:03
  • To be more precise, you _can_ get the stdout in real-time but programs sometimes do not print to stdout in real-time. I think that the reason for buffering is that printing to stdout is generally slow. Therefore, it is a good practice to minimize the overhead needed to write to stdout if that does not affect the user experience. – radekholy24 May 13 '22 at 07:50
  • As for using Python to do the trick, I think that the reason is that your use case is not that common to make it to the standard library. But there seems to be some (unmaintained) thrid-party libraries: https://stackoverflow.com/a/57528597/542196 – radekholy24 May 14 '22 at 04:37