2

I want to start many independent tasks (job steps) as part of one job and want to keep track of the highest exit code of all these tasks.

Inspired by this question I am currently doing something like

#SBATCH stuf....

for i in {1..3}; do
    srun -n 1 ./myprog ${i} >& task${i}.log &
done

wait

in my jobs.sh, which I sbatch, to start my tasks.

How can I define a variable exitcode which, after the wait command, contains the highest exit code of all the tasks?

Thanks so much in advance!

carstenbauer
  • 9,817
  • 1
  • 27
  • 40

2 Answers2

2

You can store jobs' pids in an array and wait for each one, like this

#SBATCH stuf....

for i in {1..3}; do
    srun -n 1 ./myprog ${i} >& task${i}.log &
    pids+=($!)
done

for pid in ${pids[@]}; do
    wait $pid
    exitcode=$[$? > exitcode ? $? : exitcode]
done

echo $exitcode
oguz ismail
  • 1
  • 16
  • 47
  • 69
2

You can use GNU parallel to your advantage in such case:

#SBATCH stuf....

parallel --joblog ./jobs.log -P 3 "srun -n1 --exclusive ./myprog {} >& task{}.log " ::: {1..3}

This will run srun ./mprog three times with arguments respectively 1, 2 and 3, and redirect the output to three files names task1.log, task2.log and task3.log, just like your for-loop does.

With the --joblog option, it will furthermore create a file jobs.log that will contain some information about each run, among which is the exit code, in column 7. You can then extract the maximum with

awk 'NR>1 {print $7}' jobs.log | sort -n | tail -1 
damienfrancois
  • 52,978
  • 9
  • 96
  • 110