Setting smaller buffer size for sys.stdin?

Question

I'm running memcached with the following bash command pattern:

memcached -vv 2>&1 | tee memkeywatch2010098.log 2>&1 | ~/bin/memtracer.py | tee memkeywatchCounts20100908.log

to try and track down unmatched gets to sets for keys platform wide.

The memtracer script is below and works as desired, with one minor issue. Watching the intermediate log file size, memtracer.py doesn't start getting input until memkeywatchYMD.log is about 15-18K in size. Is there a better way to read in stdin or perhaps a way to cut the buffer size down to under 1k for faster response times?

#!/usr/bin/python

import sys
from collections import defaultdict

if __name__ == "__main__":


    keys = defaultdict(int)
    GET = 1
    SET = 2
    CLIENT = 1
    SERVER = 2

    #if <
    for line in sys.stdin:
        key = None
        components = line.strip().split(" ")
        #newConn = components[0][1:3]
        direction = CLIENT if components[0].startswith("<") else SERVER

        #if lastConn != newConn:        
        #    lastConn = newConn

        if direction == CLIENT:            
            command = SET if components[1] == "set" else GET
            key = components[2]
            if command == SET:                
                keys[key] -= 1                                                                                    
        elif direction == SERVER:
            command = components[1]
            if command == "sending":
                key = components[3] 
                keys[key] += 1

        if key != None:
            print "%s:%s" % ( key, keys[key], )

score 38 · Accepted Answer · edited Feb 18 '15 at 00:46

38

You can completely remove buffering from stdin/stdout by using python's -u flag:

-u     : unbuffered binary stdout and stderr (also PYTHONUNBUFFERED=x)
         see man page for details on internal buffering relating to '-u'

and the man page clarifies:

   -u     Force stdin, stdout and stderr to  be  totally  unbuffered.   On
          systems  where  it matters, also put stdin, stdout and stderr in
          binary mode.  Note that there is internal  buffering  in  xread-
          lines(),  readlines()  and  file-object  iterators ("for line in
          sys.stdin") which is not influenced by  this  option.   To  work
          around  this, you will want to use "sys.stdin.readline()" inside
          a "while 1:" loop.

Beyond this, altering the buffering for an existing file is not supported, but you can make a new file object with the same underlying file descriptor as an existing one, and possibly different buffering, using os.fdopen. I.e.,

import os
import sys
newin = os.fdopen(sys.stdin.fileno(), 'r', 100)

should bind newin to the name of a file object that reads the same FD as standard input, but buffered by only about 100 bytes at a time (and you could continue with sys.stdin = newin to use the new file object as standard input from there onwards). I say "should" because this area used to have a number of bugs and issues on some platforms (it's pretty hard functionality to provide cross-platform with full generality) -- I'm not sure what its state is now, but I'd definitely recommend thorough testing on all platforms of interest to ensure that everything goes smoothly. (-u, removing buffering entirely, should work with fewer problems across all platforms, if that might meet your requirements).

edited Feb 18 '15 at 00:46

Richard Hansen

51,690
20
90
97

answered Sep 08 '10 at 17:40

Alex Martelli

854,459
170
1,222
1,395

thanks, the -u flag for a linux environment was the winner. I had previously tried using os.fdopen and ran into the same buffering issue, even if I set the buffer size to 10. – David Sep 08 '10 at 18:48
14

Unfortunately, Python 3 stubbornly still opens `stdin` in buffered text mode. Only `stdout` and `stderr` are affected by the `-u` switch now. – Martijn Pieters Jan 11 '13 at 17:55
Any work-arounds for Python3? Perhaps an event-driven library/option? – Brad Hein Jan 24 '14 at 16:37
I tried with gio_channels, and got it working - but the behaviour is exactly the same: no output till `enter` is pressed – jcoppens Jun 06 '15 at 15:10
3

This worked for me in Python 3.4.3: `os.fdopen(sys.stdin.fileno(), 'rb', buffering=0)` – Denilson Sá Maia Dec 03 '15 at 15:32
Nice idea, @DenilsonSá! – Alex Martelli Dec 03 '15 at 22:14
2

@DenilsonSáMaia: No need to reopen it yourself. `sys.stdin` is really three layers; an `io.TextIOWrapper` (to decode `bytes` to `str`) wrapping an `io.BufferedReader` (to buffer `bytes`) wrapping an `io.FileIO` (the actual thing that submits the system calls). And they're all available as attributes; `sys.stdin.buffer` gets the `BufferedReader` without text decoding, `sys.stding.buffer.raw` gets the `FileIO` without buffering. – ShadowRanger Nov 20 '20 at 19:50
On Python 2.6-2.7, I'd recommend using `io.open(sys.stdin.fileno(), 'rb', buffering=0, closefd=True)` over `os.fdopen`; `os.fdopen` still hands you back a `file` object that is implemented on top of C `stdio`. `io.open` is the Python 3 style in binary mode with buffering disabled bypasses both C `stdio` wrapping in favor of OS native I/O, and avoids the problem with `for line in sys.stdin:` (which unbuffering doesn't fix, due to internal buffering in `file.__next__`, which otherwise requires [weird hacks to work around](https://stackoverflow.com/a/28919832/364696)). – ShadowRanger Nov 20 '20 at 19:55

score 23 · Answer 2 · answered Aug 14 '13 at 15:03

23

You can simply use sys.stdin.readline() instead of sys.stdin.__iter__():

import sys

while True:
    line = sys.stdin.readline()
    if not line: break # EOF

    sys.stdout.write('> ' + line.upper())

This gives me line-buffered reads using Python 2.7.4 and Python 3.3.1 on Ubuntu 13.04.

answered Aug 14 '13 at 15:03

Søren Løvborg

8,354
2
47
40

2

This isn't really relevant to the question, did you mean to make this as a comment. – David Aug 15 '13 at 18:53
2

As I understood, the question was "Is there a better way to read in stdin" [to avoid input buffer issues when using a Python script in a pipeline], and my answer (three years late as it may be) is "Yes, use `readline` instead of `__iter__`". But maybe my answer is platform dependent, and you still have buffer issues if you try the above code? – Søren Løvborg Aug 16 '13 at 10:42
Ahkey, I understand. I meant MUCH smaller buffer sizes ( like 80 bytes or less ) for stdin buffering. For 2.7 you can't effect those buffer sizes without the -U flag Alex mentions in his answer. – David Aug 19 '13 at 21:51
2

Interesting Alex didn't catch this, https://github.com/certik/python-2.7/blob/c360290c3c9e55fbd79d6ceacdfc7cd4f393c1eb/Objects/fileobject.c#L1377 you're correct that readline is likely faster as it uses getc incrementally while file_internext buffers 8192 as defined in source. – David Aug 19 '13 at 22:33
This is pretty important -- I see programs not being interactive enough due to buffering stdin (instead of reacting immediately). I did not know this before. – dan3 Oct 25 '13 at 08:51

score 12 · Answer 3 · answered Mar 07 '15 at 20:50

12

The sys.stdin.__iter__ still being line-buffered, one can have an iterator that behaves mostly identically (stops at EOF, whereas stdin.__iter__ won't) by using the 2-argument form of iter to make an iterator of sys.stdin.readline:

import sys

for line in iter(sys.stdin.readline, ''):
    sys.stdout.write('> ' + line.upper())

Or provide None as the sentinel (but note that then you need to handle the EOF condition yourself).

answered Mar 07 '15 at 20:50

Antti Haapala -- Слава Україні

129,958
22
279
321

2

This seems like it would have been better as a comment to soren's answer. Alex Martelli and Soren have provided answers while this is more so an improvement on Soren's input. – David Mar 10 '15 at 12:03
1

What you propose here here is the best solution I've seen to this horrible problem; I'm about to sweep through all my python code and replace "for line in sys.stdin" with it. I see it's actually listed in the ref page you referred to. What's still not clear to me is... why on earth does "for line in sys.stdin" behave differently from "for line in iter(sys.stdin.readline, ''):"? As far as I can see they are semantically identical except that the former version's behavior is what looks to me like a nasty bug, behavior no one could ever want. If anyone has a counterexample I'd love to see it. – Don Hatch Jan 29 '16 at 19:04
@DonHatch when iterating on stdin I agree that the behaviour is weird and bug-like, but when the file is not stdin reading 8k at once will improve performance. – Sam Jacobson Aug 31 '16 at 08:09
@SamJacobson Why would it matter whether the input stream in question is stdin or not? (Perhaps you are meaning to point out some difference among terminals, files, and pipes? But such differences are independent of whether it's stdin.) And when you say reading 8k at once will improve performance-- improve performance compared to what?? I don't think I've proposed or advocated any behavior that would ever read less than 8k at once when 8k is available on the input. – Don Hatch Aug 31 '16 at 12:21
@SamJacobson BTW I filed https://bugs.python.org/issue26290 on this a while ago: "fileinput and 'for line in sys.stdin' do strange mockery of input buffering". – Don Hatch Aug 31 '16 at 12:23
This is definitely the right solution on Python 2; `for line in sys.stdin:` on Python 2 uses internal usermode buffering that blocks until it fills a block before producing any lines, and there's no way to disable it aside from ensuring you don't use `file.__next__` (`-u` doesn't help); the only solutions are to use `file.readline` as demonstrated here, or rewrap using the `io` module to get Python 3 behaviors (which don't block until a block is filled even when buffering is enabled; it's essentially a single system call and it's fine with a short read if the short read includes a newline). – ShadowRanger Nov 20 '20 at 19:58

score 6 · Answer 4 · answered Dec 06 '15 at 22:52

This worked for me in Python 3.4.3:

import os
import sys

unbuffered_stdin = os.fdopen(sys.stdin.fileno(), 'rb', buffering=0)

The documentation for fdopen() says it is just an alias for open().

open() has an optional buffering parameter:

buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer.

In other words:

Fully unbuffered stdin requires binary mode and passing zero as the buffer size.
Line-buffering requires text mode.
Any other buffer size seems to work in both binary and text modes (according to the documentation).

EvertW · Answer 5 · 2019-09-18T10:45:59.973

It may be that your troubles are not with Python but with the buffering that the Linux shell injects when chaining commands with pipes. When this is the problem, the input is not buffered by line, but by 4K block.

To stop this buffering, precede the command chain with the unbuffer command from the expect package, such as:

unbuffer memcached -vv 2>&1 | unbuffer -p tee memkeywatch2010098.log 2>&1 | unbuffer -p ~/bin/memtracer.py | tee memkeywatchCounts20100908.log

The unbuffer command needs the -p option when used in the middle of a pipeline.

score 0 · Answer 6 · edited May 23 '17 at 11:46

0

The only way I could do it with python 2.7 was:

tty.setcbreak(sys.stdin.fileno())

from Python nonblocking console input . This completly disable the buffering and also suppress the echo.

EDIT: Regarding Alex's answer, the first proposition (invoking python with -u) is not possible in my case (see shebang limitation).

The second proposition (duplicating fd with smaller buffer: os.fdopen(sys.stdin.fileno(), 'r', 100)) is not working when I use a buffer of 0 or 1, as it is for an interactive input and I need every character pressed to be processed immediatly.

edited May 23 '17 at 11:46

Community

1
1

answered Feb 03 '17 at 16:45

calandoa

5,668
2
28
25

Weird, Alex's answer worked for me back then. Wonder if a backport update changed/broke something – David Feb 04 '17 at 21:41
`tty.setcbreak` is not about python buffering but the kernel tty layer buffering input. Thus this does not apply to pipes. – textshell Feb 10 '19 at 12:24

Setting smaller buffer size for sys.stdin?

6 Answers6

Linked

Related