3

I would like process a file line by line. However I need to sort it first which I normally do by piping:

sort --key=1,2 data |./script.py.  

What's the best to call sort from within python? Searching online I see subprocess or the sh module might be possibilities? I don't want to read the file into memory and sort in python as the data is very big.

mata
  • 67,110
  • 10
  • 163
  • 162
Simd
  • 19,447
  • 42
  • 136
  • 271
  • 3
    FYI: sort has to read the file in memory – SheetJS Jul 23 '13 at 18:49
  • 2
    Not exactly. Linux sort is very clever about it and can sort massive files even bigger than RAM by using an external memory sorting algorithm. Seehttp://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sort-a-very-large-file . – Simd Jul 23 '13 at 18:51
  • Maybe sort does, maybe it doesn't have to read the file into memory, but then again, why does Python? (If `sort` can be clever, so can Python.) – kojiro Jul 23 '13 at 19:03
  • 1
    @kojiro Because there is no external memory sort module for python afaik. There is nothing stopping someone writing one of course. – Simd Jul 23 '13 at 19:05
  • @kojiro, but external sor tin linux already wirtten but you have to write it for py – RiaD Jul 23 '13 at 19:06

3 Answers3

3

Its easy. Use subprocess.Popen to run sort and read its stdout to get your data.

import subprocess
myfile = 'data'
sort = subprocess.Popen(['sort', '--key=1,2', myfile],
    stdout=subprocess.PIPE)
for line in sort.stdout:
    your_code_here
sort.wait()
assert sort.returncode == 0, 'sort failed'
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • What does sort.wait() do? It looks like sort would have to have finished before it got to that line in any case. – Simd Jul 23 '13 at 19:28
  • The for loop reads the process stdout until the process closes it (usually on process exit but the process is free do do an early close if it wants to). The wait() call waits for process exit and gets its return code. This is important on linux as it removes the zombie process from the system process table. – tdelaney Jul 23 '13 at 19:32
0

I believe sort will read all data in memory, so I'm not sure you will won anything but you can use shell=True in subprocess and use pipeline

>>> subprocess.check_output("ls", shell = True)
'1\na\na.cpp\nA.java\na.php\nerase_no_module.cpp\nerase_no_module.cpp~\nWeatherSTADFork.cpp\n'
>>> subprocess.check_output("ls | grep j", shell = True)
'A.java\n'

Warning
Invoking the system shell with shell=True can be a security hazard if combined with untrusted input. See the warning under Frequently Used Arguments for details.

RiaD
  • 46,822
  • 11
  • 79
  • 123
  • Thanks. Sort will not read the whole file into memory. See http://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sor‌​t-a-very-large-file. . – Simd Jul 23 '13 at 19:01
0

I think this page will answer your question

The answer I prefer, from @Eli Courtwright is (all quoted verbatim):

Here's a summary of the ways to call external programs and the advantages and disadvantages of each:

  1. os.system("some_command with args") passes the command and arguments to your system's shell. This is nice because you can actually run multiple commands at once in this manner and set up pipes and input/output redirection. For example,
    os.system("some_command < input_file | another_command > output_file")
    However, while this is convenient, you have to manually handle the escaping of shell characters such as spaces, etc. On the other hand, this also lets you run commands which are simply shell commands and not actually external programs.
    http://docs.python.org/lib/os-process.html

  2. stream = os.popen("some_command with args") will do the same thing as os.system except that it gives you a file-like object that you can use to access standard input/output for that process. There are 3 other variants of popen that all handle the i/o slightly differently. If you pass everything as a string, then your command is passed to the shell; if you pass them as a list then you don't need to worry about escaping anything.
    http://docs.python.org/lib/os-newstreams.html

  3. The Popen class of the subprocess module. This is intended as a replacement for os.popen but has the downside of being slightly more complicated by virtue of being so comprehensive. For example, you'd say
    print Popen("echo Hello World", stdout=PIPE, shell=True).stdout.read()
    instead of
    print os.popen("echo Hello World").read()
    but it is nice to have all of the options there in one unified class instead of 4 different popen functions.
    http://docs.python.org/lib/node528.html

  4. The call function from the subprocess module. This is basically just like the Popen class and takes all of the same arguments, but it simply wait until the command completes and gives you the return code. For example:
    return_code = call("echo Hello World", shell=True)
    http://docs.python.org/lib/node529.html

  5. The os module also has all of the fork/exec/spawn functions that you'd have in a C program, but I don't recommend using them directly.

The subprocess module should probably be what you use.

Community
  • 1
  • 1
Aaron
  • 2,344
  • 3
  • 26
  • 32