I've been trying to sort some very large CSVs using the command line sort, so that they are ready for processing in Python. I'm trying to use subprocess to just do this in Python, but can't get it to work. Here's the code:
import subprocess
fn = 'path/to/filename'
p1 = subprocess.Popen(shlex.split('tail -n +2 {}'.format(fn)), stdout=subprocess.PIPE)
p2 = subprocess.Popen(shlex.split("sort -t$'\t' -k2,2n -k3,3"), stdin=p1.stdout, stdout=subprocess.PIPE)
output = p2.communicate()[0]
print(output)
When I print
p1.communicate()[0]
I get the bytestream of the file, as expected, but when I print
p2.communicate()[0]
I get an empty bytestream, and I can't figure out why.
As a side note, if there's a better way of sorting a CSV too large to fit in memory, then I'd love to hear about it.