11

I've implemented a non-blocking reader in Python, and I need to make it more efficient.

The background: I have massive amounts of output that I need to read from one subprocess (started with Popen()) and pass to another thread. Reading the output from that subprocess must not block for more than a few ms (preferably for as little time as is necessary to read available bytes).

Currently, I have a utility class which takes a file descriptor (stdout) and a timeout. I select() and readline(1) until one of three things happens:

  1. I read a newline
  2. my timeout (a few ms) expires
  3. select tells me there's nothing to read on that file descriptor.

Then I return the buffered text to the calling method, which does stuff with it.

Now, for the real question: because I'm reading so much output, I need to make this more efficient. I'd like to do that by asking the file descriptor how many bytes are pending and then readline([that many bytes]). It's supposed to just pass stuff through, so I don't actually care where the newlines are, or even if there are any. Can I ask the file descriptor how many bytes it has available for reading, and if so, how?

I've done some searching, but I'm having a really hard time figuring out what to search for, let alone if it's possible.

Even just a point in the right direction would be helpful.

Note: I'm developing on Linux, but that shouldn't matter for a "Pythonic" solution.

Matt
  • 775
  • 7
  • 24
  • 1
    Here is a utility that you should know about: [pipe viewer](http://www.ivarch.com/programs/pv.shtml) – wim Nov 19 '13 at 17:52

4 Answers4

5

On Linux, os.pipe() is just a wrapper around pipe(2). Both return a pair of file descriptors. Normally one would use lseek(2) (os.lseek() in Python) to reposition the offset of a file decsriptor as a way to get the amount of available data. However, not all file descriptors capable of seeking.

On Linux trying lseek(2) on a pipe will return an error, see the manual page. That's because a pipe is more or less a buffer between a producer and a consumer of data. The size of that buffer is system dependant.

On Linux, a pipe has a 64 kB buffer, so that is the most data you can have available.

Edit: If you can change the way your subprocess works, you might consider using a memory mapped file, or a nice big piece of shared memory.

Edit2: Using polling objects is probably faster than select.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • So even though my class might generically be able to use lseek() to count the number of available chars in a file descriptor, I won't be able to do that if I'm passing it a pipe of stdout (from Popen()), right? Am I understanding you correctly? – Matt Nov 19 '13 at 17:58
  • Are there any other ways that I might ask a pipe how many bytes it has waiting, or is lseek() the only way that kind of thing is done? – Matt Nov 19 '13 at 18:01
  • 1
    AFAIK, lseek is basically the only way. – Roland Smith Nov 19 '13 at 18:07
0

This question seems to offer a possible solution, though it may require retooling.

Non-blocking read on a subprocess.PIPE in python

Otherwise, I assume you know about reading data N bytes at a time:

all_data = ''
while True:
    data = pipe.read(1024)   # Reads 1024 bytes or to end of pipe
    if not data:
        break
    all_data += data
    # Add your timeout break here
Community
  • 1
  • 1
supergra
  • 1,578
  • 13
  • 19
  • Won't `pipe.read(1024)` block until it gets 1024 bytes or throws an exception (like finding EOF)? – Matt Nov 19 '13 at 18:07
  • ...the method in that other question seems interesting, but doesn't solve my particular problem. I have a non-blocking reader that works; I need to know if I can do a non-blocking read THIS way ;-) – Matt Nov 19 '13 at 18:11
  • Yes, it will block until the 1024 bytes is read. You can make 1024 as small as you like to make it highly unlikely (but not guaranteed) to exceed your timeout limit. But your readline(1) is also blocking, right? Albeit on a smaller scale. – supergra Nov 19 '13 at 22:42
  • 2
    Yes, that's exactly the problem I'm trying to solve. Because I have an unpredictable number of bytes to be read, there's no way to choose a perfect byte limit. I could choose a byte limit that **on average** neither blocks very many times, nor requires very many reiterations, but both of those conditions are undesirable. I'd rather read **exactly** the number of bytes that are waiting, and neither block, nor have to go back and get the rest. But it appears that's not possible... – Matt Nov 20 '13 at 20:34
  • 1
    @mHurley: fyi, `os.read(fd, 8096)` may return less than `8096` bytes i.e., a simple `select()` (to avoid blocking if there are *zero* bytes available in timeout seconds) + `os.read()` (to get the data available after the `select()`) might be enough. You could test whether `epoll()` produces better results in your case and try different `buffersize`s with `os.read()` (larger is not necessarily better). – jfs Jun 04 '16 at 20:30
0

You can find this out by calling os.fstat(file_descriptor) and checking the st_size property, which is the number of bytes written.

import os
reader_file_descriptor, writer_file_descriptor = os.pipe()
os.write(writer_file_descriptor, b'I am some data')
readable_bytes = os.fstat(writer_file_descriptor).st_size
spacether
  • 2,136
  • 1
  • 21
  • 28
0

I've implemented this based on the idea from spacether's answer

import select
import os

def readLen(p):
    # works on mac, might work on Linux, probably doesn't on windows (maybe return 1 in that case)
    size = os.fstat(p.fileno()).st_size
    return size

def readIfAny(p, timeout=1, default=None):
    if select.select([p], [], [], timeout)[0]:
        size = readLen(p)
        if size:
            return p.read(size)
    return default

....

import sys
data = readIfAny(sys.stdin)

Note that I've read in some places you should try to avoid reading and writing to a sub-process pipe directly like this to avoid deadlocks. but this is the safest way I've found so far.

Note 2: sys.stdin.read will return b'' or '' on eof i think. this doesn't seem to raise any exception, and i still don't really know how to tell when it finishes.

note 3: depending the mode in which they're open you get bytes or a string. also it works with stdin, stdout, and stderr.

Nande
  • 409
  • 1
  • 6
  • 11