3

I have a Slurm job array for which the job file includes a --requeue directive. Here is the full job file:

#!/bin/bash
#SBATCH --job-name=catsss
#SBATCH --output=logs/cats.log
#SBATCH --array=1-10000
#SBATCH --requeue
#SBATCH --partition=scavenge
#SBATCH --mem=32g
#SBATCH --time=24:00:00
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=douglas.duhaime@gmail.com
module load Langs/Python/3.4.3
python3 cats.py ${SLURM_ARRAY_TASK_ID} 'cats'

Several of the array values have restarted at least once. I would like to know, how many times will these jobs restart before they are finally cancelled by the scheduler? Will the restarts carry on indefinitely until a sysadmin manually cancels them, or do jobs like this have a maximum number of retries?

duhaime
  • 25,611
  • 17
  • 169
  • 224

1 Answers1

3

AFAIK, the jobs can be requeued in infinite occasions. You just decide if the job is prepared to be requeued or not. If not-requeue, then it will never be requeued. If requeue, then it will be requeued everytime the system decides it is needed (node failure, higher priority job preemption...).

The jobs keep restarting until they finish (successfully or not, but finished instead of interrupted).

Poshi
  • 5,332
  • 3
  • 15
  • 32
  • So if the process fails due to time expiration, for instance, it will just continue to restart forever? – duhaime Jul 21 '18 at 11:46
  • 1
    No, that's a proper stop of the program. It is expected that if you restart the same script with the same parameters will fail again, so it is not requeued. It is kust killed and marked as failed due to time limit. – Poshi Jul 21 '18 at 12:14
  • 2
    Requeueing mostly occur when requested by a sysadmin (after a scheduled downtime), due to node failure or due to being preempted to let a higher priority job start. – Poshi Jul 21 '18 at 12:17
  • 2
    If your time limit is not properly adjusted, or your memory requirements, or your script simply fails, the job is completed and ends up in some finished state: FAILED, COMPLETED... – Poshi Jul 21 '18 at 12:19