A possible approach (free of busy waiting and ugliness of that kind) is to track the number of jobs on the client side, cap their total count at 500 and, each time any of them finishes, immediately start a new one to replace it. (This is, however, based on the assumption that the client script outlives the jobs.) Concrete steps:
Make the qsub
tool block and (passively) wait for the completion of its remote job. Depending on the particular qsub
implementation, it may have a -sync
flag or something more complex may be needed.
Keep exactly 500, no more and, if possible, no fewer waiting instances of qsub
. This can be automated by using this answer or this answer and setting MAX_PARALLELISM
to 500
there. qsub
itself would be started from the do_something_and_maybe_fail()
function.
Here’s a copy&paste of the Bash outline from the answers linked above, just to make this answer more self-contained. Starting with a trivial and runnable harness / dummy example (with a sleep
instead of a qsub -sync
):
#!/bin/bash
set -euo pipefail
declare -ir MAX_PARALLELISM=500 # pick a limit
declare -i pid
declare -a pids=()
do_something_and_maybe_fail() {
### qsub -sync "$@" ... ### # add the real thing here
sleep $((RANDOM % 10)) # remove this :-)
return $((RANDOM % 2 * 5)) # remove this :-)
}
for pname in some_program_{a..j}{00..60}; do # 600 items
if ((${#pids[@]} >= MAX_PARALLELISM)); then
wait -p pid -n \
&& echo "${pids[pid]} succeeded" 1>&2 \
|| echo "${pids[pid]} failed with ${?}" 1>&2
unset 'pids[pid]'
fi
do_something_and_maybe_fail & # forking here
pids[$!]="${pname}"
echo "${#pids[@]} running" 1>&2
done
for pid in "${!pids[@]}"; do
wait -n "$((pid))" \
&& echo "${pids[pid]} succeeded" 1>&2 \
|| echo "${pids[pid]} failed with ${?}" 1>&2
done
The first loop needs to be adjusted for the specific use case. An example follows, assuming that the right do_something_and_maybe_fail()
implementation is in place and that one_command_per_line.txt
is a list of arguments for qsub
, one invocation per line, with an arbitrary number of lines. (The script could accept a file name as an argument or just read the commands from standard input, whatever works best.) The rest of the script would look exactly like the boilerplate above, keeping the number of parallel qsub
s at MAX_PARALLELISM
at most.
while read -ra job_args; do
if ((${#pids[@]} >= MAX_PARALLELISM)); then
wait -p pid -n \
&& echo "${pids[pid]} succeeded" 1>&2 \
|| echo "${pids[pid]} failed with ${?}" 1>&2
unset 'pids[pid]'
fi
do_something_and_maybe_fail "${job_args[@]}" & # forking here
pids[$!]="${job_args[*]}"
echo "${#pids[@]} running" 1>&2
done < /path/to/one_command_per_line.txt