This question reiterates a previous question
Why is "except: pass" a bad programming practice?
in application to SLURM
tool for parallel computation. I understand it may not be the best place to ask, but will appreciate any advice.
The way it works with slurm, I use the sbatch
command to send a script.sh
file for parallel computation with different parameters. My script.sh
can be simplified to basically containing a single command srun python3 py_script.py
. I have omitted the unnecessary details.
Finally the py_script.py
contains the following infinite cycle (the number of runs is not known beforehand):
import sys
while True:
try:
<calculate condition>
if condition == 0:
print("Finishing.")
sys.exit(0)
<do stuff otherwise>
except Exception as e:
print("Encountered error" + e )
sys.exit(-1)
print('Calculating')
<do calculations>
My problem is that when the condition==0
is met (I checked it always happens), the script prints "Finishing" to the output file, but apparently continues running, as if sys.exit(0)
was not processed by SLURM. I see the running code (R) in the squeue
output.
At the same time, I'm sure the cycle has stopped because I don't see "Calculating" printing any longer in the log, and the log is not updated anymore.
So it's not the issue mentioned in
sys.exit() not exiting in python .
I thought it might be due to that I'm in the try... catch...
block, but answers to the previous question
Why is "except: pass" a bad programming practice?
clearly indicate that Exception
does not include things like SystemExit
, so I should be able to exit.
What annoys me even more is that this behavior does not always occur, but only on some runs from the batch (all similar). It also did not seem to happen before the last update of the SLURM
system, but here I am not sure.
Does anyone have any idea of what may be going on?
Can it be the same issue as here? R parallel job hangs