Running external commands partly in parallel from python (or bash)

Asked Apr 19 '16 at 11:08

Active Jun 22 '16 at 16:21

Viewed 313 times

I am running a python script which creates a list of commands which should be executed by a compiled program (proprietary).

The program kan split some of the calculations to run independently and the data will then be collected afterwards.

I would like to run these calculations in parallel as each are a very time consuming single threaded task and I have 16 cores available.

I am using subprocess to execute the commands (in Class environment):

def run_local(self):
    p = Popen(["someExecutable"], stdout=PIPE, stdin=PIPE)
    p.stdin.write(self.exec_string)
    p.stdin.flush()
    while(p.poll() is not none):
        line = p.stdout.readline()
        self.log(line)

Where self.exec_string is a string of all the commands.

This string an be split into: an initial part, the part i want parallelised and a finishing part.

How should i go about this?

Also it seems the executable will "hang" (waiting for a command, eg. "exit" which will release the memory) if a naive copy-paste of the current method is used for each part.

Bonus: The executable also has the option to run a bash script of commands, if it is easier/possible to parallelise bash?

asked Apr 19 '16 at 11:08

Andreas Gravgaard Andersen

related: [Python threading multiple bash subprocesses?](http://stackoverflow.com/q/14533458/4279) – jfs Apr 19 '16 at 15:04
Definitely related, but it does not solve partial parallelled p.stdout.readline() – Andreas Gravgaard Andersen Apr 19 '16 at 16:50
here's [how to perform I/O concurrently](http://stackoverflow.com/q/23611396/4279). Though you've accepted a bash answer that doesn't read anything in parallel. – jfs Apr 20 '16 at 04:25

2 Answers2

For bash, it could be very simple. Assuming your file looks like this:

## init part##
ls
cd ..
ls
cat some_file.txt

## parallel ##
heavycalc &
heavycalc &
heavycalc &

## finish ##
wait
cat results.txt

With & behind the command you tell bash to run this command in a background-thread. wait will then wait for all background-threads to finish, so you can be sure, all calculations are done.

I've assumed your input txt-file are plain bash-commands.

answered Apr 19 '16 at 11:17

SKR

With a bit luck this might actually work with some pipe'ing of commands to the executable.. I like the simplicity, so I might accept as answer because the remaining problems might be too specific. – Andreas Gravgaard Andersen Apr 19 '16 at 11:59
Turns out pipe'ing doesn't really work. `echo "some command" | someExecutable &` The pipe commands can't seem to access the pointers of the regular commands.. They spawn a new instance of the executable instead. Do you have a good idea of how to pipe into some kind of shared memory instance? – Andreas Gravgaard Andersen Apr 19 '16 at 12:24
you could try to make your command a subcommand and running the parent in background: `eval "echo \"some command\" | someExecutable" &` – SKR Apr 19 '16 at 12:49
"spawn a new instance of the executable". What do you mean? With every call of `someExecuteable &` the executable is called and, if not otherwise defined in the exec itself, spawned. What is your expected behaviour? – SKR Apr 19 '16 at 14:07
I expected (that the program was disk-based enough) that every new thread would find the data from the first commands. However this was not the case.. I just figured out a way to write and read little enough data that multiprocessing still is a major advantage. `echo "some command" | someExecutable &` was the way to go! I appreciate the note about `wait` - I think it works perfectly now! – Andreas Gravgaard Andersen Apr 19 '16 at 16:49

Using GNU Parallel:

## init
cd foo
cp bar baz

## parallel ##
parallel heavycalc ::: file1 file2 file3 > results.txt

## finish ##
cat results.txt

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

answered Jun 22 '16 at 16:21

Ole Tange

31,768
5
86
104