11

I am running a job array with SLURM, with the following job array script (that I run with sbatch job_array_script.sh [args]:

#!/bin/bash

#SBATCH ... other options ...

#SBATCH --array=0-1000%200

srun ./job_slurm_script.py $1 $2 $3 $4

echo 'open' > status_file.txt

To explain, I want job_slurm_script.py to be run as an array job 1000 times with 200 tasks maximum in parallel. And when all of those are done, I want to write 'open' to status_file.txt. This is because in reality I have more than 10,000 jobs, and this is above my cluster's MaxSubmissionLimit, so I need to split it into smaller chunks (at 1000-element job arrays) and run them one after the other (only when the previous one is finished).

However, for this to work, the echo statement can only trigger once the entire job array is finished (outside of this, I have a loop which checks status_file.txt so see if the job is finished, i.e when the contents are the string 'open').

Up to now I thought that srun holds the script up until the whole job array is finished. However, sometimes srun "returns" and the script goes to the echo statement before the jobs are finished, so all the subsequent jobs bounce off the cluster since it goes above the submission limit.

So how do I make srun "hold up" until the whole job array is finished?

Marses
  • 1,464
  • 3
  • 23
  • 40

3 Answers3

22

You can add the flag --wait to sbatch.

Check the manual page of sbatch for information about --wait.

vsocrates
  • 172
  • 13
Aditya Kulkarni
  • 260
  • 2
  • 3
  • 4
    This is not a helpful answer, you just link to the general documentation opposed to `sbatch` specifically: https://slurm.schedmd.com/sbatch.html – Dylan Madisetti Apr 20 '21 at 18:37
  • 2
    @DylanMadisetti you can [suggest an edit](https://stackoverflow.com/posts/49509245/edit) to improve posts! – ti7 Apr 20 '21 at 18:37
7

You can use --wait option in sbatch in combination with wait in bash to send jobs off to the cluster, pause script execution until those are complete, and then continue. E.g.

#!/bin/bash
set -e
date

for((i=0; i<5; i++)); do
    sbatch -W --wrap='echo "hello from $SLURM_ARRAY_TASK_ID"; sleep 10' &
done;
wait

date
echo "I am finished"
irritable_phd_syndrome
  • 4,631
  • 3
  • 32
  • 60
-4

You can use the wait bash command. It will wait until whatever lines of code above are finished. Thus you script should look like this:

#!/bin/bash

#SBATCH ... other options ...

#SBATCH --array=0-1000%200

srun ./job_slurm_script.py $1 $2 $3 $4

wait

echo 'open' > status_file.txt
rmdcoding
  • 31
  • 3
  • This doesn't seem to work and gives the same problem as before. – Marses Oct 03 '17 at 08:16
  • What version of SLURM are you running and on what kind of system? – rmdcoding Oct 03 '17 at 15:31
  • slurm 17.02.7. Also what do you mean by system. From what I've seen, srun doesn't *immediately* skip past onto the next command. Usually what seems to happen is that srun holds/waits for quite a while. But something then happens to make it skip past. I'm not sure what, although one thing I suspect is that this happens when all the array job tasks are pending. – Marses Oct 05 '17 at 09:20