SLURM sbatch multiple parallel calls to executable

Question

I have an executable that takes multiple options and multiple file inputs in order to run. The executable can be called with a variable number of cores to run.

E.g. executable -a -b -c -file fileA --file fileB ... --file fileZ --cores X

I'm trying to create an sbatch file that will enable me to have multiple calls of this executable with different inputs. Each call should be allocated in a different node (in parallel with the rest), using X cores. The parallelization at core level is taken care of the executable, while at the node level by SLURM.

I tried with ntasks and multiple sruns but the first srun was called multiple times.

Another take was to rename the files and use a SLURM process or node number as filename before the extension but it's not really practical.

Any insight on this?

Does it has to be a single script or can it be multiple sbatch scripts? — Carles Fenoy, Aug 18 '15 at 12:40

score 2 · Answer 1 · answered Aug 20 '15 at 15:34

i do these kind of jobs always with the help of bash script that i run by a sbatch command. The easiest approach would be to have a loop in a sbatch script where you spawn the different job and job steps under your executable with srun specifying i.e. the corresponding node name in your partion with -w . You may also read up the documentation of slurm array jobs (if that befits you better). Alternatively you could also store all parameter combinations in a file and than loop over them with the script of have a look at "array job" manual page.

Maybe the following script (i just wrapped it up) helps you to get a feeling for what i have in mind (i hope its what you need). Its not tested so dont just copy and paste it!

#!/bin/bash

parameter=(10 5 2)
node_names=(node1 node2 node3)


# lets run one job per node each time taking one parameter

for parameter in ${parameter[*]}
    # asign parameter to node
    #script some if else condition here to specify parameters
    # -w specifies the name of the node to use
    # -N specifies the amount of nodes
    JOBNAME="jmyjob$node-$parameter"
    # asign the first job to the node
    $node=${node_names[0]}
    #delete first node from list
    unset node_names[0];
    #reinstantiate list
    node_names=("${Unix[@]}")
    srun -N1 -w$node -psomepartition -JJOBNAME executable.sh model_parameter &

done;

You will have the problem that you need to force your sbatch script to wait for the last job step. In this case the follwoing additional while loop might help you.

# Wait for the last job step to complete
while true;
do
    # wait for last job to finish use the state of sacct for that
    echo "waiting for last job to finish"
    sleep 10
    # sacct shows your jobs, -R only running steps
    sacct -s R,gPD|grep "myjob*" #your job name indicator
    # check the status code of grep (1 if nothing found)
    if [ "$?" == "1" ];
    then
    echo "found no running jobs anymore"
    sacct -s R |grep "myjob*"
    echo "stopping loop"
    break;
    fi
done;

acct -s R,gPD is not recognized. Is it a typo or another version? I've removed ",gPD" but the script does not end and keeps all nodes occupied. — IVy, Sep 22 '15 at 15:46
yes you need to find a grep expression that finds pending job steps or so of your running job. — PlagTag, Sep 23 '15 at 08:26
A simple wait after the srun commands doesn't suffice? As shown here http://geco.mines.edu/scripts/notes.pdf - page 62? — IVy, Sep 23 '15 at 11:48
@IVy, good question! i rememeber that i used it a long time ago. But should be easy to test. I will test that next time i am writing a batch job. Btw, i think it could be worth to look into array jobs here as well. — PlagTag, Sep 23 '15 at 12:37

IVy · Answer 2 · 2015-09-23T06:45:46.263

1

I managed to find one possible solution, so I'm posting it for reference:

I declared as many tasks as calls to the executable, as well as nodes and the desired number of cpus per call.

And then a separate srun for each call, declaring the number of nodes and tasks at each call. All the sruns are bound with ampersands (&):

srun -n 1 -N 1 --exclusive executable -a1 -b1 -c1 -file fileA1 --file fileB1 ... --file fileZ1 --cores X1 &

srun -n 1 -N 1 --exclusive executable -a2 -b2 -c2 -file fileA2 --file fileB2 ... --file fileZ2 --cores X2 &

....

srun -n 1 -N 1 --exclusive executable -aN -bN -cN -file fileAN --file fileBN ... --file fileZN --cores XN

--Edit: After some tests (as I mentioned in a comment below), if the process of the last srun ends before the rest, it seems to end the whole job, leaving the rest unfinished.

--edited based on the comment by Carles Fenoy

edited Sep 23 '15 at 06:45

answered Aug 18 '15 at 15:42

IVy

119
3
7

You can use [GNU Parallel](https://www.gnu.org/software/parallel/) along with srun to ease the generation of the command arguments. – damienfrancois Sep 02 '15 at 19:42
I have a problem with the last srun. If it is the first to end, it kills all the remaining processes. Even if I add nokill and -k to each srun call. Any ideas? – IVy Sep 22 '15 at 15:44
1

@IVy you can use `wait` as the last command – akraf Jun 30 '21 at 09:19

score 0 · Answer 3 · answered Mar 16 '18 at 17:19

Write a bash script to populate multiple xyz.slurm files and submit each of them using sbatch. Following script does a a nested for loop to create 8 files. Then iterate over them to replace a string in those files, and then batch them. You might need to modify the script to suit your need.

#!/usr/bin/env bash
#Path Where you want to create slurm files
slurmpath=~/Desktop/slurms
rm -rf $slurmpath
mkdir -p $slurmpath/sbatchop
mkdir -p /exports/home/schatterjee/reports
echo "Folder /slurms and /reports created"

declare -a threads=("1" "2" "4" "8")
declare -a chunks=("1000" "32000")
declare -a modes=("server" "client")

## now loop through the above array
for i in "${threads[@]}"
{
    for j in "${chunks[@]}"
    {
#following are the content of each slurm file
cat <<EOF >$slurmpath/net-$i-$j.slurm
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=$slurmpath/sbatchop/net-$i-$j.out
#SBATCH --wait-all-nodes=1
echo \$SLURM_JOB_NODELIST

cd /exports/home/schatterjee/cs553-pa1

srun ./MyNETBench-TCP placeholder1 $i $j
EOF
    #Now schedule them
      for m in "${modes[@]}"
      {
        for value in {1..5}
        do
        #Following command replaces placeholder1 with the value of m
        sed -i -e 's/placeholder1/'"$m"'/g' $slurmpath/net-$i-$j.slurm
        sbatch $slurmpath/net-$i-$j.slurm
        done
      }
   }
}

score 0 · Answer 4 · answered Apr 16 '21 at 01:58

0

You can also try this python wrapper which can execute your command on the files you provide

answered Apr 16 '21 at 01:58

Usman Sadiq

1

SLURM sbatch multiple parallel calls to executable

4 Answers4

Linked