1

This question reiterates a previous question Why is "except: pass" a bad programming practice? in application to SLURM tool for parallel computation. I understand it may not be the best place to ask, but will appreciate any advice.

The way it works with slurm, I use the sbatch command to send a script.sh file for parallel computation with different parameters. My script.sh can be simplified to basically containing a single command srun python3 py_script.py. I have omitted the unnecessary details.

Finally the py_script.py contains the following infinite cycle (the number of runs is not known beforehand):

import sys

while True:
    try:
        <calculate condition>
        if condition == 0:
            print("Finishing.")
            sys.exit(0)
        <do stuff otherwise>

    except Exception as e:
        print("Encountered error" + e )
        sys.exit(-1)

    print('Calculating')
    <do calculations>

My problem is that when the condition==0 is met (I checked it always happens), the script prints "Finishing" to the output file, but apparently continues running, as if sys.exit(0) was not processed by SLURM. I see the running code (R) in the squeue output. At the same time, I'm sure the cycle has stopped because I don't see "Calculating" printing any longer in the log, and the log is not updated anymore. So it's not the issue mentioned in sys.exit() not exiting in python . I thought it might be due to that I'm in the try... catch... block, but answers to the previous question Why is "except: pass" a bad programming practice? clearly indicate that Exception does not include things like SystemExit, so I should be able to exit. What annoys me even more is that this behavior does not always occur, but only on some runs from the batch (all similar). It also did not seem to happen before the last update of the SLURM system, but here I am not sure.

Does anyone have any idea of what may be going on?

Can it be the same issue as here? R parallel job hangs

Dr_Zaszuś
  • 546
  • 1
  • 7
  • 15

1 Answers1

0

By default the SLURM configuration allows processes in a job to complete, even if a process returns a non-zero exit code. In the slurm.conf (admin side) most probably there is this setting KillOnBadExit=0 defined.

You can override this behavior via srun (user side) by calling either srun -K=1 your_commands or srun --kill-on-bad-exit=1 your_commands.