18

my problem is the following:

My pythons script receives data via sys.stdin, but it needs to wait until new data is available on sys.stdin.

As described in the manpage from python, i use the following code but it totally overloads my cpu.

#!/usr/bin/python -u
import sys
while 1:
     for line in sys.stdin.readlines():
         do something useful

Is there any good way to solve the high cpu usage?

Edit:

All your solutions don't work. I give you exactly my problem.

You can configure the apache2 daemon that he sends every logline to a program and not to write in a logfile.

This looks something like that:

CustomLog "|/usr/bin/python -u /usr/local/bin/client.py" combined

Apache2 expects from my script that it runs always, waits for data on sys.stdin and parses it then there is data.

If i only use a for loop the script will exit, because at a point there is no data in sys.stdin and apache2 will say ohh your script exited unexpectedly.

If i use a while true loop my script will use 100% cpu usage.

Abalus
  • 215
  • 1
  • 2
  • 6
  • 3
    It sounds like your problem lies elsewhere then. In the python script it doesn't matter if there is data in stdin, just so long as it is open. Whatever is writing to the python script is closing the stream prematurely. – Dunes Aug 14 '11 at 11:39
  • First, you should be aware of the difference between `readline()` and `readlines()`. `readlines()` will read all the input from stdin until EOF (basically calling `read()` then splitting by newline). That means that it will return for the first time when stdin is closed. Future calls to `readlines()` (or `read()` and `readline()`) on stdin will return `[]` (or `""` for read/readline). Suggested reading: https://docs.python.org/2/tutorial/inputoutput.html https://unix.stackexchange.com/questions/103885/piping-data-to-a-processs-stdin-without-causing-eof-afterward – FluxLemur Jul 09 '18 at 16:20

9 Answers9

22

The following should just work.

import sys
for line in sys.stdin:
    # whatever

Rationale:

The code will iterate over lines in stdin as they come in. If the stream is still open, but there isn't a complete line then the loop will hang until either a newline character is encountered (and the whole line returned) or the stream is closed (and the whatever is left in the buffer is returned).

Once the stream has been closed, no more data can be written to or read from stdin. Period.

The reason that your code was overloading your cpu is that once the stdin has been closed any subsequent attempts to iterate over stdin will return immediately without doing anything. In essence your code was equivalent to the following.

for line in sys.stdin:
    # do something

while 1:
    pass # infinite loop, very CPU intensive

Maybe it would be useful if you posted how you were writing data to stdin.

EDIT:

Python will (for the purposes of for loops, iterators and readlines() consider a stream closed when it encounters an EOF character. You can ask python to read more data after this, but you cannot use any of the previous methods. The python man page recommends using

import sys
while True:
    line = sys.stdin.readline()
    # do something with line

When an EOF character is encountered readline will return an empty string. The next call to readline will function as normal if the stream is still open. You can test this out yourself by running the command in a terminal. Pressing ctrl+D will cause a terminal to write the EOF character to stdin. This will cause the first program in this post to terminate, but the last program will continue to read data until the stream is actually closed. The last program should not 100% your CPU as readline will wait until there is data to return rather than returning an empty string.

I only have the problem of a busy loop when I try readline from an actual file. But when reading from stdin, readline happily blocks.

Dunes
  • 37,291
  • 7
  • 81
  • 97
  • Yeah i also thought about using the normal log and open it with open and then work with seek and tell to get only the new lines in that but however in perl its working. So i want to know how to do that in python. – Abalus Aug 14 '11 at 12:03
  • 1
    There is no such thing as an "EOF character" when a program reads input. The OS intercepts ^D and closes the standard input to the program, but the program never sees the ^D. To see this, type `cat | wc` at the prompt and immediately type ^D: You'll send 0 characters to wc. – alexis Feb 25 '12 at 16:55
  • While you're right that there's no literal EOF character that enters the program's buffer, you're wrong to assert that the stream is closed. After Ctrl-D is entered the next call to underlying implementation of read (C's read) will return the EOF macro. Subsequent calls to read will the block until more data enters the buffer. So the stream never truly closes, it use informs the program that whatever is on the other end has signalled its intention to stop sending data. – Dunes Feb 26 '12 at 21:25
  • @Dunes: On my system (Ubuntu, screen). `sys.stdin.readline()` returns only empty strings (meaning EOF) once I've typed Ctrl+D even if I try to provide further input in the `while True` loop – jfs Apr 25 '14 at 01:42
  • @J.F.Sebastian Still works for me. I'm running Ubuntu 14 on virtualbox. Maybe you could try writing `/proc//fd/0`. What happens to your process? Do you get a busy loop, or does it just hang? – Dunes Jun 03 '14 at 15:55
  • I can't reproduce the behaviour now i.e., `sys.stdin.readline()` may return non-empty value after EOF (empty value) – jfs Jun 03 '14 at 19:06
  • @Dunes just use this `nc -l 12345 | python test.py` and connect with other process `telnet localhost 12345` when you exit from telnet you'll get empty strings without blocking. – estani Oct 29 '14 at 14:36
  • @estani Take a look at my new solution if you are interested. http://stackoverflow.com/a/26640086/529630 – Dunes Oct 29 '14 at 20:32
  • @Dunes looks good, but I'm still on python 2.7 and cannot test it. Could you test it with some streaming socket or something that can get close before sigterming the process? – estani Oct 29 '14 at 23:49
4

This actually works flawlessly (i.e. no runnaway CPU) - when you call the script from the shell, like so:

tail -f input-file | yourscript.py

Obviously, that is not ideal - since you then have to write all relevant stdout to that file -

but it works without a lot of overhead! Namely because of using readline() - I think:

while 1:
        line = sys.stdin.readline()

It will actually stop and wait at that line until it gets more input.

Hope this helps someone!

rm-vanda
  • 3,122
  • 3
  • 23
  • 34
3

I've come back to problem after a long time. The issue appears to be that Apache treats a CustomLog like a file -- something it can open, write to, close, and then reopen at a later date. This causes the receiving process to be told that it's input stream has been closed. However, that doesn't mean the processes input stream cannot be written to again, just that whichever process was writing to the input stream will not be writing to it again.

The best way to deal with this is to setup a handler and let the OS know to invoke the handler whenever input is written to standard input. Normally you should avoid heavily relying on OS signal event handling as they are relatively expensive. However, copying a megabyte of text to following only produced two SIGIO events, so it's okay in this case.

fancyecho.py

import sys
import os
import signal
import fcntl
import threading

io_event = threading.Event()

# Event handlers should generally be as compact as possible.
# Here all we do is notify the main thread that input has been received.
def handle_io(signal, frame):
    io_event.set()

# invoke handle_io on a SIGIO event
signal.signal(signal.SIGIO, handle_io)
# send io events on stdin (fd 0) to our process 
assert fcntl.fcntl(0, fcntl.F_SETOWN, os.getpid()) == 0
# tell the os to produce SIGIO events when data is written to stdin
assert fcntl.fcntl(0, fcntl.F_SETFL, os.O_ASYNC) == 0

print("pid is:", os.getpid())
while True:
    data = sys.stdin.read()
    io_event.clear()
    print("got:", repr(data))
    io_event.wait()

How you might use this toy program. Output has been cleaned up due to interleaving of input and output.

$ echo test | python3 fancyecho.py &
[1] 25487
pid is: 25487
got: 'test\n'
$ echo data > /proc/25487/fd/0
got: 'data\n'
$
Dunes
  • 37,291
  • 7
  • 81
  • 97
3

Use this:

#!/usr/bin/python
import sys
for line in sys.stdin.readlines():
    pass # do something useful
hamstergene
  • 24,039
  • 5
  • 57
  • 72
  • If i would use your code the script would end if there is no data left. But my script needs to wait until new data comes in. – Abalus Aug 14 '11 at 10:48
  • 2
    No. The `for` loop will hang waiting for more data. When stdin is closed, the loop will end and the script will continue execution. – hamstergene Aug 14 '11 at 10:49
  • No it wont work in my case. My script needs to behave like this code in perl #!/usr/bin/perl $| = 1; while () { # ...put here any transformations or lookups... print $_; } – Abalus Aug 14 '11 at 10:51
  • 1
    Try the solution and come back if there's a problem. – hamstergene Aug 14 '11 at 10:55
  • Actually, you're both wrong, in this case. Abalus: The script doesn't end when there's no data left in stdin, it does so when stdin closes (though it still wouldn't work for you). @hamstergene: sys.stdin.readlines() doesn't yield lines as they're found, but only when a ctrl-d/EOF is received. – Mr. B Apr 09 '15 at 20:40
1

I know I am bringing old stuff to life, but this seems to be one of the top hits on the topic. The solution Abalus has settled for has fixed time.sleep each cycle, regardles if the stdin is actually empty and the program should be idling or there are a lot of lines waiting to be processed. A small modification makes the program process all messages rapidly and wait only if the queue is actually empty. So only one line that arrives during the sleep period can wait, the others are processed without any lag.

This example is simply reversing the input lines, if you submit only one line it responds in a second (or whatever your sleep period is set), but can also process something like "ls -l | reverse.py" really quickly. The CPU load for such approach is minimal even on embedded systems like OpenWRT.

import sys
import time

while True:
  line=sys.stdin.readline().rstrip()
  if line:       
    sys.stdout.write(line[::-1]+'\n')
  else:
    sys.stdout.flush()
    time.sleep(1)
Robert Špendl
  • 136
  • 1
  • 3
1

I have been having a similar problem where python waits for the sender (whether a user or another program) to close the stream before the loop starts executing. I had solved it, but it was clearly non pythonic as I had to resort to while True: and sys.stdin.readline()

I eventually found a reference in a comment in another post to a module called io, which is an alternative to the standard file object. In Python 3 this is the default. From what I can make out Python 2 treats stdin like a normal file and not a stream.

Try this, it worked for me:

sys.stdin = io.open(sys.stdin.fileno())  # default is line buffering, good for user input

for line in sys.stdin:
    # Do stuff with line
Community
  • 1
  • 1
1

Well i will stick now to these lines of code.

#!/usr/bin/python
import sys
import time
while 1:
    time.sleep(0.01)
    for line in sys.stdin:
        pass # do something useful

If i don't use time.sleep, the script will create a too high load on cpu usage.

If i use:

for line in sys.stdin.readline():

It will only parse one line in 0.01 seconds and the performance of the apache2 is really bad Thank you very much for your answers.

best regards Abalus

Abalus
  • 215
  • 1
  • 2
  • 6
0

I know this is an old thread but I stumbled upon the same problem and figured out that this was more to do with how the script was invoked rather than a problem with the script. At least in my case this turned out to be the problem with the 'system shell' on debian (ie: what /bin/sh is linked to -- this is what apache uses to execute the command that CustomLog pipes to). More info here: http://www.spinics.net/lists/dash/msg00675.html

hth, - steve

lonetwin
  • 971
  • 10
  • 17
0

This works for me, code of /tmp/alog.py:

#! /usr/bin/python

import sys

fout = open("/tmp/alog.log", "a")

while True:
    dat = sys.stdin.readline()
    fout.write(dat)
    fout.flush()

in http.conf:

CustomLog "|/tmp/alog.py" combined

The key is don't use

for dat in sys.stdin:

You will wait there get nothing. And for testing, remember fout.flush(), otherwise you may not see output. I test on fedora 15, python 2.7.1, Apache 2.2, not cpu load, alog.py will exists in memory, if you ps you can see it.

PasteBT
  • 2,128
  • 16
  • 17