16

The Context

I am using the subprocess module to start a process from python. I want to be able to access the output (stdout, stderr) as soon as it is written/buffered.

  • The solution must support Windows 7. I require a solution for Unix systems too but I suspect the Windows case is more difficult to solve.
  • The solution should support Python 2.6. I am currently restricted to Python 2.6 but solutions using later versions of Python are still appreciated.
  • The solution should not use third party libraries. Ideally I would love a solution using the standard library but I am open to suggestions.
  • The solution must work for just about any process. Assume there is no control over the process being executed.

The Child Process

For example, imagine I want to run a python file called counter.py via a subprocess. The contents of counter.py is as follows:

import sys

for index in range(10):

    # Write data to standard out.
    sys.stdout.write(str(index))

    # Push buffered data to disk.
    sys.stdout.flush()

The Parent Process

The parent process responsible for executing the counter.py example is as follows:

import subprocess

command = ['python', 'counter.py']

process = subprocess.Popen(
    cmd,
    bufsize=1,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    ) 

The Issue

Using the counter.py example I can access the data before the process has completed. This is great! This is exactly what I want. However, removing the sys.stdout.flush() call prevents the data from being accessed at the time I want it. This is bad! This is exactly what I don't want. My understanding is that the flush() call forces the data to be written to disk and before the data is written to disk it exists only in a buffer. Remember I want to be able to run just about any process. I do not expect the process to perform this kind of flushing but I still expect the data to be available in real time (or close to it). Is there a way to achieve this?

A quick note about the parent process. You may notice I am using bufsize=0 for line buffering. I was hoping this would cause a flush to disk for every line but it doesn't seem to work that way. How does this argument work?

You will also notice I am using subprocess.PIPE. This is because it appears to be the only value which produces IO objects between the parent and child processes. I have come to this conclusion by looking at the Popen._get_handles method in the subprocess module (I'm referring to the Windows definition here). There are two important variables, c2pread and c2pwrite which are set based on the stdout value passed to the Popen constructor. For instance, if stdout is not set, the c2pread variable is not set. This is also the case when using file descriptors and file-like objects. I don't really know whether this is significant or not but my gut instinct tells me I would want both read and write IO objects for what I am trying to achieve - this is why I chose subprocess.PIPE. I would be very grateful if someone could explain this in more detail. Likewise, if there is a compelling reason to use something other than subprocess.PIPE I am all ears.

Method For Retrieving Data From The Child Process

import time
import subprocess
import threading
import Queue


class StreamReader(threading.Thread):
    """
    Threaded object used for reading process output stream (stdout, stderr).   
    """

    def __init__(self, stream, queue, *args, **kwargs):
        super(StreamReader, self).__init__(*args, **kwargs)
        self._stream = stream
        self._queue = queue

        # Event used to terminate thread. This way we will have a chance to 
        # tie up loose ends. 
        self._stop = threading.Event()

    def stop(self):
        """
        Stop thread. Call this function to terminate the thread. 
        """
        self._stop.set()

    def stopped(self):
        """
        Check whether the thread has been terminated.
        """
        return self._stop.isSet()

    def run(self):
        while True:
            # Flush buffered data (not sure this actually works?)
            self._stream.flush()

            # Read available data.
            for line in iter(self._stream.readline, b''):
                self._queue.put(line)

            # Breather.
            time.sleep(0.25)

            # Check whether thread has been terminated.
            if self.stopped():
                break


cmd = ['python', 'counter.py']

process = subprocess.Popen(
    cmd,
    bufsize=1,
    stdout=subprocess.PIPE,
    )

stdout_queue = Queue.Queue()
stdout_reader = StreamReader(process.stdout, stdout_queue)
stdout_reader.daemon = True
stdout_reader.start()

# Read standard out of the child process whilst it is active.  
while True:

    # Attempt to read available data.  
    try:
        line = stdout_queue.get(timeout=0.1)
        print '%s' % line

    # If data was not read within time out period. Continue. 
    except Queue.Empty:
        # No data currently available.
        pass

    # Check whether child process is still active.
    if process.poll() != None:

        # Process is no longer active.
        break

# Process is no longer active. Nothing more to read. Stop reader thread.
stdout_reader.stop()

Here I am performing the logic which reads standard out from the child process in a thread. This allows for the scenario in which the read is blocking until data is available. Instead of waiting for some potentially long period of time, we check whether there is available data, to be read within a time out period, and continue looping if there is not.

I have also tried another approach using a kind of non-blocking read. This approach uses the ctypes module to access Windows system calls. Please note that I don't fully understand what I am doing here - I have simply tried to make sense of some example code I have seen in other posts. In any case, the following snippet doesn't solve the buffering issue. My understanding is that it's just another way to combat a potentially long read time.

import os
import subprocess

import ctypes
import ctypes.wintypes
import msvcrt

cmd = ['python', 'counter.py']

process = subprocess.Popen(
    cmd,
    bufsize=1,
    stdout=subprocess.PIPE,
    )


def read_output_non_blocking(stream):
    data = ''
    available_bytes = 0

    c_read = ctypes.c_ulong()
    c_available = ctypes.c_ulong()
    c_message = ctypes.c_ulong()

    fileno = stream.fileno()
    handle = msvcrt.get_osfhandle(fileno)

    # Read available data.
    buffer_ = None
    bytes_ = 0
    status = ctypes.windll.kernel32.PeekNamedPipe(
        handle,
        buffer_,
        bytes_,
        ctypes.byref(c_read),
        ctypes.byref(c_available),
        ctypes.byref(c_message),
        )

    if status:
        available_bytes = int(c_available.value)

    if available_bytes > 0:
        data = os.read(fileno, available_bytes)
        print data

    return data

while True:

    # Read standard out for child process.
    stdout = read_output_non_blocking(process.stdout)
    print stdout

    # Check whether child process is still active.
    if process.poll() != None:

        # Process is no longer active.
        break

Comments are much appreciated.

Cheers

Yani
  • 1,465
  • 2
  • 16
  • 25
  • I'm not sure if I completely understand your problem, but question ["Python subprocess reading"](http://stackoverflow.com/q/5745471/2419207) may be worth looking at. – iljau Jan 23 '14 at 04:01
  • @iljau: Thanks. It's a similar issue and the EOF condition could play a part here but the responses to that question don't really provide a solution. I think its more of a question about how I can control the buffering. I need some way in which I can force the data to be flushed (or written to disk) more frequently. Or perhaps there is an entirely different solution. I was thinking sockets might work? I am still investigating. On the other hand - perhaps its wiser to just let the operating system do its thing. – Yani Jan 23 '14 at 04:38
  • Maybe [answer to "Non-blocking read on a subprocess.PIPE in python"](http://stackoverflow.com/a/4896288/2419207) may be of some help. – iljau Jan 23 '14 at 04:48
  • @iljau: Thanks again for your efforts. There are some useful responses in that question. However, `select` and `fcntl` not properly supported for Windows platform (`select` is supported but only using `socket` objects). `asyncproc`, `twisted` and `tornado` are all third party packages but I should look into these anyway, even if just for educational purposes. The `PYTHONUNBUFFERED` environment variable works but only if the executable (the child process) is a python script. Not bad! – Yani Jan 23 '14 at 06:10
  • Now this is a long shot, but article ["Asynchronous I/O in Windows for Unix Programmers"](http://tinyclouds.org/iocp-links.html) may give some useful pointers. – iljau Jan 23 '14 at 06:16
  • Also there is a python package named `pywin32` ([docs](http://timgolden.me.uk/pywin32-docs/contents.html) / [downloads](http://sourceforge.net/projects/pywin32/files/pywin32/Build%20218/)), which allows to access win32 api in a bit less painful way. – iljau Jan 23 '14 at 06:23
  • Note that `flush` only flushes the buffer. I/O is usually buffered one way or another, and Python normally uses line buffering for terminals, fixed-size buffers for everything else (including pipes). Only if the I/O happens to be a disk-based file does flushing mean that data is written to disk. – Martijn Pieters Jan 23 '14 at 11:32
  • You already discovered that you need to tell the **child** process not to buffer (using `PYTHONUNBUFFERED`, or explicitly flushing). *This is not something `subprocess` can solve*, because buffering is the responsibility of the child process itself. – Martijn Pieters Jan 23 '14 at 11:33

2 Answers2

11

At issue here is buffering by the child process. Your subprocess code already works as well as it could, but if you have a child process that buffers its output then there is nothing that subprocess pipes can do about this.

I cannot stress this enough: the buffering delays you see are the responsibility of the child process, and how it handles buffering has nothing to do with the subprocess module.

You already discovered this; this is why adding sys.stdout.flush() in the child process makes the data show up sooner; the child process uses buffered I/O (a memory cache to collect written data) before sending it down the sys.stdout pipe 1.

Python automatically uses line-buffering when sys.stdout is connected to a terminal; the buffer flushes whenever a newline is written. When using pipes, sys.stdout is not connected to a terminal and a fixed-size buffer is used instead.

Now, the Python child process can be told to handle buffering differently; you can set an environment variable or use a command-line switch to alter how it uses buffering for sys.stdout (and sys.stderr and sys.stdin). From the Python command line documentation:

-u
Force stdin, stdout and stderr to be totally unbuffered. On systems where it matters, also put stdin, stdout and stderr in binary mode.

[...]

PYTHONUNBUFFERED
If this is set to a non-empty string it is equivalent to specifying the -u option.

If you are dealing with child processes that are not Python processes and you experience buffering issues with those, you'll need to look at the documentation of those processes to see if they can be switched to use unbuffered I/O, or be switched to more desirable buffering strategies.

One thing you could try is to use the script -c command to provide a pseudo-terminal to a child process. This is a POSIX tool, however, and is probably not available on Windows.


1. It should be noted that when flushing a pipe, no data is 'written to disk'; all data remains entirely in memory here. I/O buffers are just memory caches to get the best performance out of I/O by handling data in larger chunks. Only if you have a disk-based file object would fileobj.flush() cause it to push any buffers to the OS, which usually means that data is indeed written to disk.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks! Your description is nice and clear. Unfortunately I cannot assume the child process is a python process so the `PYTHONUNBUFFERED` solution will only work for specific cases. You mention that Python automatically uses line buffering when `sys.stdout` is connected to a terminal. I tried the `shell=True` option in hope that it would force line buffering by running the child process via a shell but this didn't work. Do you know how this works? Also, is there no way I can tell the system to handle buffering the way I want it to? – Yani Jan 24 '14 at 00:11
  • A shell is not a terminal. Python uses the [`isatty()` system call](http://linux.die.net/man/3/isatty) to determine if a stream is a TTY (terminal) or not. Again, the way Python behaves is application specific, other applications may choose to behave differently again. You could use the [`script` command](http://linux.die.net/man/1/script) to make the child process think it us connected to a TTY though. – Martijn Pieters Jan 24 '14 at 00:29
  • @MartijnPieters I love you! Python is one of the only processes that was buffering its output in my application, and the environment variable solution worked beautifully for me. I used process.StartInfo.EnvironmentVariables.Add("PYTHONUNBUFFERED", "TRUE"); to fix my issues, for anyone else who is having trouble. I wish I could upvote your response more than once, thanks! – Darkhydro Feb 01 '14 at 02:08
  • You totally made my day. I was pulling my hair out and the -u flag is exactly what I needed. – DrRobotNinja Nov 17 '14 at 22:35
2

expect has a command called 'unbuffer':

http://expect.sourceforge.net/example/unbuffer.man.html

that will disable buffering for any command

rbp
  • 1,850
  • 15
  • 28