1

I've been trying to sort some very large CSVs using the command line sort, so that they are ready for processing in Python. I'm trying to use subprocess to just do this in Python, but can't get it to work. Here's the code:

import subprocess
fn = 'path/to/filename'
p1 = subprocess.Popen(shlex.split('tail -n +2 {}'.format(fn)), stdout=subprocess.PIPE)
p2 = subprocess.Popen(shlex.split("sort -t$'\t' -k2,2n -k3,3"), stdin=p1.stdout, stdout=subprocess.PIPE)
output = p2.communicate()[0]
print(output)

When I print

p1.communicate()[0]

I get the bytestream of the file, as expected, but when I print

p2.communicate()[0]

I get an empty bytestream, and I can't figure out why.

As a side note, if there's a better way of sorting a CSV too large to fit in memory, then I'd love to hear about it.

Jeremy
  • 1,960
  • 4
  • 21
  • 42
  • unrelated: to make the shell pipeline more readable in Python, you could [use `plumbum`](http://stackoverflow.com/a/16709666/4279) – jfs Jul 16 '15 at 16:37

1 Answers1

1

There is an extraneous dollar sign in the -t flag of your sort command, remove it and it should work:

p2 = subprocess.Popen(shlex.split("sort -t'\t' -k2,2n -k3,3"), stdin=p1.stdout, stdout=subprocess.PIPE)
Hai Vu
  • 37,849
  • 11
  • 66
  • 93
  • @Jeremy: removing `$` works by accident (because Python and /bin/sh interpret `\t` in the same way). In general, `$'...'` is interpreted by `/bin/sh` and you can't just drop it blindly; you should consider what it does in each case. – jfs Jul 16 '15 at 16:35