3

While my original problem was solved in a different manner (see comment thread under this question, as well as the edits to this question), I was able to create a stack/LIFO for GNU Parallel in Bash. So I will edited my background/question to reflect a situation where it could be needed.

Background

I am using GNU Parallel to process files with a Bash script. As the files are processed, more files are created and new commands need to be added to parallel's list. I am not able to give parallel a complete list of commands, as information is generated as the initial files are processed.

I need a way to add the lines to parallel's list while it is running.

Parallel will also need to wait for a new line if nothing is in the queue and exit once the queue is finished.

Solution

First I created a fifo:

mkfifo /tmp/fifo

Next I created a bash file that cat's the file and pipes the output to parallel, which checks for the end_of_file line. (I wrote this with help from the accepted answer as well as from here)

#!/bin/bash
while true;
do
cat /tmp/fifo
done | parallel --ungroup --gnu --eof "end_of_file" "{}"

Then I write to the pipe with this command, adding lines to parallel's queue:

echo "command here" > /tmp/fifo

With this setup, all new commands are added to the queue. Once the queue is full parallel will begin processing it. This means that if you have slots for 32 jobs (32 processors), then you will need to add 32 jobs in order to start the queue.

If parallel is occupying all of its processors, it will put the job on hold until a processor becomes available.

By using the --ungroup argument, parallel will process/output jobs as they are added to the queue once the queue is full.

Without the --ungroup argument, parallel waits until a new slot is needed to complete a job. From the accepted answer:

Output from the running or completed jobs are held back and will only be printed when JobSlots more jobs has been started (unless you use --ungroup or -u, in which case the output from the jobs are printed immediately). E.g. if you have 10 jobslots then the output from the first completed job will only be printed when job 11 has started, and the output of second completed job will only be printed when job 12 has started.

Community
  • 1
  • 1
Jake
  • 625
  • 6
  • 16
  • When A is done, can B-E be run in parallel? Or is C dependent on B to complete? – Ole Tange Aug 25 '15 at 19:05
  • C depends on B, D on C and so forth. – Jake Aug 25 '15 at 19:15
  • If they depend, then I cannot see how you can use your CPUs better than making a function with A-E, and run that for each file - running one for each CPU in parallel. In what situations will your CPUs sit idle when they would not have to? – Ole Tange Aug 25 '15 at 19:27
  • I am manipulating 4 images, with 5 functions. One or two of the images process in 50% of the time of the others. Instead of dividing into 4 large/uneven chunks, I would prefer to divide it into 20 (4 images * 5 functions). This way they can be evenly distributed. I realize that the amount of time saved is very small, but I am working with very small margins. – Jake Aug 25 '15 at 19:40
  • But if the functions are dependent you cannot divide into 4*5 chunks: You can only run 4 functions (namely one for each image) in parallel. You can only divide into 4 groups - each containing the 5 functions. So even if you have 32 cores, you will still only be able to run 4 functions in parallel. If you have more files than you have cores, then try to sort the files, so the slow ones start first - this should give you the optimal runtime. – Ole Tange Aug 25 '15 at 19:54
  • 1
    @OleTange You are correct. I should have thought it through a little better. It doesn't matter which processor crunches the numbers, B has to wait for A. So the total time cannot be less than the largest A+B+C+D+E time. – Jake Aug 25 '15 at 20:09
  • 1
    ... and then suddenly the solution becomes trivially simple – Ole Tange Aug 25 '15 at 20:11

1 Answers1

3

From http://www.gnu.org/software/parallel/man.html#EXAMPLE:-GNU-Parallel-as-queue-system-batch-manager

There is a a small issue when using GNU parallel as queue system/batch manager: You have to submit JobSlot number of jobs before they will start, and after that you can submit one at a time, and job will start immediately if free slots are available. Output from the running or completed jobs are held back and will only be printed when JobSlots more jobs has been started (unless you use --ungroup or -u, in which case the output from the jobs are printed immediately). E.g. if you have 10 jobslots then the output from the first completed job will only be printed when job 11 has started, and the output of second completed job will only be printed when job 12 has started.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104