1

I frequently need to sort a collection of files that contain headers. Because sorting depends on the contents of the header, this use case is more complicated that similar questions (e.g., Is there a way to ignore header lines in a UNIX sort?).

I was hoping to use Python to read files, output the header of the first file, then pipe the tails into sort. I've tried this as a proof of concept:

#!/usr/bin/env python

import io
import subprocess
import sys

header_printed = False

sorter = subprocess.Popen(['sort'], stdin=subprocess.PIPE)

for f in sys.argv[1:]:
    fd = io.open(f,'r')
    line = fd.readline()
    if not header_printed:
        print(line)
        header_printed = True
    sorter.communicate(line)

When called as header-sort fileA fileB, with fileA and fileB containing lines like

c   float   int
Y   0.557946     413
F   0.501935     852
F   0.768102     709

I get:

# sort file 1
Traceback (most recent call last):
  File "./archive/bin/pipetest", line 17, in <module>
    sorter.communicate(line)
  File "/usr/lib/python2.7/subprocess.py", line 785, in communicate
    self.stdin.write(input)
ValueError: I/O operation on closed file

The problem is communicate takes a string and the pipe is closed after writing. This means that the content must be read fully into memory. communicate doesn't take a generator (I tried).

An even simpler demonstration of this is:

>>> import subprocess
>>> p = subprocess.Popen(['tr', 'a-z', 'A-Z'], stdin=subprocess.PIPE)
>>> p.communicate('hello')
HELLO(None, None)
>>> p.communicate('world')
Traceback (most recent call last):
  File "<ipython-input-14-d6873fd0f66a>", line 1, in <module>
    p.communicate('world')
  File "/usr/lib/python2.7/subprocess.py", line 785, in communicate
    self.stdin.write(input)
ValueError: I/O operation on closed file

So, the question is, what's the right way (with Popen or otherwise) to stream data into a pipe in Python?

Community
  • 1
  • 1
Reece
  • 7,616
  • 4
  • 30
  • 46
  • related: [Sorting text file by using Python](http://stackoverflow.com/q/14465154/4279) – jfs Sep 19 '15 at 21:24

3 Answers3

2

For your specific case, if you only passed subprocess.PIPE for a single standard handle (in your case, stdin), then in your example, you can safely call sorter.stdin.write(line) over and over. When you're finished writing output, call sorter.stdin.close() so sort knows the input is finished, and it can perform the actual sort and output work (sorter.communicate() with no argument would probably work too; otherwise, after closing stdin you'd probably want to call sorter.wait() to let it finish).

If you need to deal with more than one piped standard handle, the right way is either threading with a dedicated thread for each pipe that must be handled beyond the first (relatively simple in concept, but heavyweight and introduces all the headaches of threading), or using the select module (or in Python 3.4+, the selectors module), which is quite tricky to get right, but can (under some circumstances) be more efficient. Lastly, there is creating temporary files for output, so you can write directly to the process's stdin while the process writes to a file (and therefore won't block); you can then read the file at your leisure (note that the subprocess won't necessarily have flushed it's own output buffers until it exits, so the output may not arrive promptly in response to your input until further inputs and outputs have filled and flushed the buffer).

subprocess.Popen's .communicate() method uses either threads or select module primitives itself (depending on OS support; the implementation is under the various _communicate methods here) whenever you pass subprocess.PIPE for more than one of the standard handles; it's how you have to do it.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • there is a single pipe (subprocess' stdin). Why do you need multiple threads here? – jfs Sep 19 '15 at 21:18
  • Yeah, my mistake. I assumed this was one of those "I'm trying to do what `communicate` does without using `communicate`" cases, and over answered. I've edited to explain how it works for the specific case with only a single `PIPE`-ed standard handle. – ShadowRanger Sep 21 '15 at 19:25
1

Just write to the pipe directly:

#!/usr/bin/env python2
import fileinput
import subprocess

process = subprocess.Popen(['sort'], stdin=subprocess.PIPE)
with process.stdin as pipe, fileinput.FileInput() as file:
    for line in file:
        if file.isfirstline(): # print header
            print line,
        else: # pipe tails
            pipe.write(line)
process.wait()
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • I started down a similar path (yours is much more elegant), but I then I saw this admonishment in the docs: "Warning Use communicate() rather than .stdin.write, .stdout.read or .stderr.read to avoid deadlocks due to any of the other OS pipe buffers filling up and blocking the child process." I understand the deadlock potential when a script is writing to a subprocess' stdin and reading from its stdout, but I don't understand the deadlock potential when the script and the subprocess are both streaming to stdout (as in your answer). Comments? – Reece Sep 20 '15 at 00:23
  • @Reece: it doesn't apply here. The rule to avoid the deadlock in the general case is simple: never use PIPE unless you consume the corresponding pipe. – jfs Sep 20 '15 at 00:30
  • That was my interpretation too. I appreciate the follow-up. – Reece Sep 20 '15 at 00:31
  • I used this answer as the basis for tool that provides multi-file sort for files with headers. Headers are defined by # of lines, prefix, or regexp. Output headers are deduped. Custom sort options are permissible. Data also accepted from stdin. It's here: https://bitbucket.org/reece/reece-base/src/a711b1ecc8a31c24c16ad8b759525c95becb0dd9/bin/header-sort?at=default&fileviewer=file-view-default – Reece Sep 21 '15 at 21:54
  • @Reece: the way you use it, you could just write to `sys.stdout` in your Python script that strips the headers i.e., instead of `./your-script .. -- sort_options` you would write: `./your-script .. | sort sort_options`. – jfs Sep 21 '15 at 22:47
  • @j-f-sebastian: I don't see how that suggestion works. The (uniquified) headers need to be output first, then the post-header content needs to be sorted and appended. `script .. | sort` would cause the headers to be sorted also, which I don't want. Or did I misunderstand your suggestion? – Reece Sep 21 '15 at 23:57
  • @Reece: drop `Popen()`, replace `pipe.write(line)` with `sys.stdout.write(line)`, replace `--` on the command line with `| sort` that is all. – jfs Sep 22 '15 at 00:01
  • But `sys.stdout.write(line)` and `print line,` will both end up on stdout. They'll be reordered momentarily, but then *both* end up getting sorted. Not only will the header lines be reordered, but they'll be comingled with the content. I think what you're proposing is tantamount to `(echo "header"; echo "a is for aardvark"; echo "z is for zebra") | sort` (or correct if I misunderstand). – Reece Sep 22 '15 at 00:13
  • @Reece: your output already contains both headers (from `python`) and sorted lines (from `sort`). You could write headers to stderr or some other file. – jfs Sep 22 '15 at 00:36
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/90272/discussion-between-reece-and-j-f-sebastian). – Reece Sep 22 '15 at 00:49
  • 1
    Upshot of chat for closure: We agreed that it's typically preferable to externalize a pipe when possible, but that the pipe above would indeed sort headers into the output. -30- – Reece Sep 22 '15 at 21:45
0

You can use writing/reading from stdin and stdout, however depending on your subprocess, you need a "flushing mechanism" for the subprocess to process your input. The below code works for the first part, but since it closes stdin, it also kills the subprocess. If you change it with flush() or if you can add some trailing characters to push your subprocess, then you can use it. Else, I would recommend to take a look at Multithreading in Python, especially pipes.

p=subprocess.Popen(['tr','a-z','A-Z'],stdin=subprocess.PIPE,stdout=subprocess.PIPE)
p.stdin.write("hello\n")
p.stdin.close()
p.stdout.readline()
'HELLO\n'
ilke444
  • 2,641
  • 1
  • 17
  • 31
  • 1
    It's really not safe with either `flush` or `close`; if you send enough data to the subprocess that its own output pipe fills, it will block. If you fill its input pipe, you block. And because it's waiting for you to read, and you're waiting for it to read, you deadlock, and never reach the `readline`. Also, if you `flush` instead of `close`, the subprocess may be block buffering its own output, so the `readline` could block forever (and you'd never return from the `readline` to send more data than might cause it to flush its buffer). – ShadowRanger Sep 19 '15 at 02:08