How to handle piped `head` on huge input with python subprocess?

Question

If a series of commands are piped in linux, it handles it efficiently, ie. it terminates the previous subprocess if the last subprocess has terminated. For instance,

cat filename | head -n 1
zcat filename | head -n 1
hadoop fs -cat /some/path | head -n 1

In each of the above, the cat command would take considerable time, but the combined command performs fast. How is it done internally? Are the first commands (cat commands) given SIGTERM, SIGKILL by the OS as soon as the head terminates?

I wanted to do something similar in Python and was wondering what should be the best way to do it. I am trying to do the following:

p1 = Popen(['hadoop','fs','-cat',path], stdout=PIPE)
p2 = Popen(['head','-n',str(num_lines)], stdin=p1.stdout,stdout=PIPE)
p2.communicate()
p1.kill() or p1.terminate()

Is this efficient?

Why use `head`? You can just read lines from p1.stdout directly in python. See this question: http://stackoverflow.com/questions/1767513/read-first-n-lines-of-a-file-in-python — jbaiter, Jun 05 '14 at 12:56
@jbaiter: Agreed but that still doesn't answer the question. I could have not used `head` and read from p1.stdout but what I want to know is whether it is safe to use p1.kill() or p1.terminate() as soon as I've read the required number of lines? Are there more elegant ways to achieve the same thing? — Mukul Gupta, Jun 05 '14 at 13:43

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

Actually, I believe that the process is being sent SIGPIPE when head closes. From Wikipedia:

SIGPIPE

The SIGPIPE signal is sent to a process when it attempts to write to a pipe without a process connected to the other end.

Also, from a few answers from a question on SIGPIPE:

...

You see, when the file descriptor with the pending write is closed, the SIGPIPE happens right then. While the write will return -1 eventually, the whole point of the signal is to notify you asynchronously that the write is no longer possible. This is part of what makes the whole elegant co-routine structure of pipes work in UNIX.

...

https://stackoverflow.com/a/8369516/2334407

...

https://www.gnu.org/software/libc/manual/html_mono/libc.html

This link says:

A pipe or FIFO has to be open at both ends simultaneously. If you read from a pipe or FIFO file that doesn't have any processes writing to it (perhaps because they have all closed the file, or exited), the read returns end-of-file. Writing to a pipe or FIFO that doesn't have a reading process is treated as an error condition; it generates a SIGPIPE signal, and fails with error code EPIPE if the signal is handled or blocked.

...

https://stackoverflow.com/a/18971899/2334407

I think it is to get the error handling correct without requiring a lot of code in everything writing to a pipe.

Some programs ignore the return value of write(); without SIGPIPE they would uselessly generate all output.

Programs that check the return value of write() likely print an error message if it fails; this is inappropriate for a broken pipe as it is not really an error for the whole pipeline.

https://stackoverflow.com/a/8370870/2334407

Now, to answer your question on what the best way to do it would be, I'd say not to send any signals. Instead, read as much data as you need to, and then simply close the pipe. The OS kernel will then automatically clean up after you and send SIGPIPE to the necessary processes.

How to handle piped `head` on huge input with python subprocess?

1 Answers1