Start independent job steps and keep track of highest exit code

Question

I want to start many independent tasks (job steps) as part of one job and want to keep track of the highest exit code of all these tasks.

Inspired by this question I am currently doing something like

#SBATCH stuf....

for i in {1..3}; do
    srun -n 1 ./myprog ${i} >& task${i}.log &
done

wait

in my jobs.sh, which I sbatch, to start my tasks.

How can I define a variable exitcode which, after the wait command, contains the highest exit code of all the tasks?

Thanks so much in advance!

Because I would like to take action based on the value of the highest error code. (I throw these error codes myself) — carstenbauer, Sep 03 '18 at 16:18

oguz ismail · Accepted Answer · 2018-09-04T07:44:13.163

2

You can store jobs' pids in an array and wait for each one, like this

#SBATCH stuf....

for i in {1..3}; do
    srun -n 1 ./myprog ${i} >& task${i}.log &
    pids+=($!)
done

for pid in ${pids[@]}; do
    wait $pid
    exitcode=$[$? > exitcode ? $? : exitcode]
done

echo $exitcode

edited Sep 04 '18 at 07:44

answered Sep 03 '18 at 16:18

oguz ismail

1
16
47
69

score 2 · Answer 2 · answered Sep 04 '18 at 06:35

You can use GNU parallel to your advantage in such case:

#SBATCH stuf....

parallel --joblog ./jobs.log -P 3 "srun -n1 --exclusive ./myprog {} >& task{}.log " ::: {1..3}

This will run srun ./mprog three times with arguments respectively 1, 2 and 3, and redirect the output to three files names task1.log, task2.log and task3.log, just like your for-loop does.

With the --joblog option, it will furthermore create a file jobs.log that will contain some information about each run, among which is the exit code, in column 7. You can then extract the maximum with

awk 'NR>1 {print $7}' jobs.log | sort -n | tail -1

extracting the maximum in awk sounds like a better idea than calling sort and tail to me — oguz ismail, Sep 04 '18 at 07:47

Start independent job steps and keep track of highest exit code

2 Answers2