I frequently need to sort a collection of files that contain headers. Because sorting depends on the contents of the header, this use case is more complicated that similar questions (e.g., Is there a way to ignore header lines in a UNIX sort?).
I was hoping to use Python to read files, output the header of the first file, then pipe the tails into sort. I've tried this as a proof of concept:
#!/usr/bin/env python
import io
import subprocess
import sys
header_printed = False
sorter = subprocess.Popen(['sort'], stdin=subprocess.PIPE)
for f in sys.argv[1:]:
fd = io.open(f,'r')
line = fd.readline()
if not header_printed:
print(line)
header_printed = True
sorter.communicate(line)
When called as header-sort fileA fileB
, with fileA and fileB containing lines like
c float int
Y 0.557946 413
F 0.501935 852
F 0.768102 709
I get:
# sort file 1
Traceback (most recent call last):
File "./archive/bin/pipetest", line 17, in <module>
sorter.communicate(line)
File "/usr/lib/python2.7/subprocess.py", line 785, in communicate
self.stdin.write(input)
ValueError: I/O operation on closed file
The problem is communicate takes a string and the pipe is closed after writing. This means that the content must be read fully into memory. communicate doesn't take a generator (I tried).
An even simpler demonstration of this is:
>>> import subprocess
>>> p = subprocess.Popen(['tr', 'a-z', 'A-Z'], stdin=subprocess.PIPE)
>>> p.communicate('hello')
HELLO(None, None)
>>> p.communicate('world')
Traceback (most recent call last):
File "<ipython-input-14-d6873fd0f66a>", line 1, in <module>
p.communicate('world')
File "/usr/lib/python2.7/subprocess.py", line 785, in communicate
self.stdin.write(input)
ValueError: I/O operation on closed file
So, the question is, what's the right way (with Popen or otherwise) to stream data into a pipe in Python?