7

I have a python script which has to call a certain app 3 times. These calls should be parralel since they take hours to complete and arent dependant on eachother. But they script should halt until all of them are complete and then do some clean up work.

Here is some code:

#do some stuff

for work in worklist:   # these should run in parralel
    output=open('test.txt','w')
    subprocess.call(work,stdout=output,stderr=output)
    output.close()

# wait for subprocesses to finish

# cleanup

so I basically want to run this command in parrelel while capturing its output to a file. once all instances are done I want to continue the script

  • related: [Python: running subprocess in parallel](http://stackoverflow.com/q/9743838/4279), [Python threading multiple bash subprocesses](http://stackoverflow.com/q/14533458/4279), [Python: running subprocess in parallel](http://stackoverflow.com/q/16450788/4279). – jfs May 23 '14 at 22:11

2 Answers2

11

subprocess.call() is blocking. That means, each call must wait for the child process to finish before continuing.

What you want is to pass your arguments to subprocess.Popen constructor, instead. That way, your child process would be started without blocking.

Later on, you can join these child processes together by calling Popen.communicate() or Popen.wait().

child_processes = []
for work, filename in worklist:
    with io.open(filename, mode='wb') as out:
        p = subprocess.Popen(work, stdout=out, stderr=out)
        child_processes.append(p)    # start this one, and immediately return to start another

# now you can join them together
for cp in child_processes:
    cp.wait()                         # this will block on each child process until it exits

P.S. Have you looked into Python's documentation on the subprocess module?

Mikhail
  • 7,749
  • 11
  • 62
  • 136
Santa
  • 11,381
  • 8
  • 51
  • 64
2

I like to use GNU Parallel (http://www.gnu.org/software/parallel/) in situations like this (requires *nix), as it provides a quick way to get parallelism and has many options, including re-organizing the output at the end such that it all flows together from each process in order but not interleaved. You can also specify the number you want to run at once, either a specific number, or matching the number of cores you have, and it will queue up the rest of the commands.

Just use subprocess.check_output with shell=True to call out to parallel using your command string. If you've got a variable you want to interpolate, say a list of SQL tables you want to run your command against, parallel is good at handling that as well -- you can pipe in the contents of a text file with the arguments.

If the commands are all totally different (as opposed to being variations on the same command), put the complete commands in the text file that you pipe into parallel.

You also don't need to do anything special to wait for them to finish, as the check_output call will block until the parallel command has finished.

khampson
  • 14,700
  • 4
  • 41
  • 43
  • `shell=True` is unsafe almost any context. – jbarlow Dec 17 '15 at 22:21
  • 1
    There are [potential issues](https://docs.python.org/2/library/subprocess.html#frequently-used-arguments), but there are certainly cases where it is fine. e.g. the input is *not* coming from arbitrary sources from the external web, etc. – khampson Dec 17 '15 at 22:40