0

I was wondering if I could ask something about running slurm jobs in parallel.(Please note that I am new to slurm and linux and have only started using it 2 days ago...)

As per the insturctions on the picture below (source : https://hpc.nmsu.edu/discovery/slurm/serial-parallel-jobs/),

instructions on how to run parallel job

I have designed the following bash script

#!/bin/bash

#SBATCH --job-name fmriGLM #job name을 다르게 하기 위해서
#SBATCH --nodes=1
#SBATCH -t 16:00:00 # Time for running job
#SBATCH -o /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/output_fmri_glm.o%j #%j : job id 가 [>
#SBATCH -e /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/error_fmri_glm.e%j
pwd; hostname; date
#SBATCH --ntasks=30
#SBATCH --mem-per-cpu=3000MB
#SBATCH --cpus-per-task=1


for num in {0..29}

do
srun --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
done

wait

The, I ran sbatch as follows: sbatch test_bash

However, when I view the outputs, it is apparent that only one of the sruns in the bash script are being executed... Could anyone tell me where I went wrong and how I can fix it?

**update : when I look at the error file I get the following : srun: Job 43969 step creation temporarily disabled, retrying. I searched the internet and it says that this could be caused by not specifying the memory and hence not having enough memory for the second job.. but I thought that I already specifeid the memory when I did --mem_per_cpu=300MB?

**update : I have tried changing the code as said as in here : Why are my slurm job steps not launching in parallel?, but.. still it didn't work

**potentially pertinent information: our node has about 96cores, which seems odd when compared to tutorials that say one node has like 4cores or something

Thank you!!

Danny Han
  • 177
  • 3
  • 9

2 Answers2

2

Try adding --exclusive to the srun command line:

srun --exclusive --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &

This will instruct srun to use a sub-allocation and work as you intended.

Note that the --exclusive option has a different meaning in this context than if used with sbatch.

Note also that different versions of Slurm have a distinct canonical way of doing this, but using --exclusive should work across most versions.

damienfrancois
  • 52,978
  • 9
  • 96
  • 110
  • thank you for your answer! Actually, I already tried what you have said, (as written in the second update to my question), but it still seems to gives me the same results... – Danny Han Apr 22 '22 at 14:24
  • (it also gives the same error, `srun : Job 43972 step creation temporarily disabled, retrying`) – Danny Han Apr 22 '22 at 14:25
  • maybe try `-c1 -n1 -N1 --exclusive` ? – damienfrancois Apr 22 '22 at 14:47
  • thank you! unfortunately, it still gives out the same error and only one task seems to be executed – Danny Han Apr 22 '22 at 14:54
  • Since ntask method doesnt seem to work, are there other ways to run parallel? I initially tried submitting multiple jobs(with one task per job), but found out that resulted in the jobs being run serially… or should I try job array? (I havent studied what is it yet, but at first glance it seemed like using job array will enable the code to be run in parallel) – Danny Han Apr 23 '22 at 03:13
  • 1
    I solved it! turns out the `pwd ;hostname; date` line of the code caused problems! (https://stackoverflow.com/questions/71978596/slurm-job-arrays-dont-work-when-used-in-argparse/71979187#71979187) – Danny Han Apr 23 '22 at 13:57
1

Even though you have solved your problem which turned out to be something else, and that you have already specified --mem_per_cpu=300MB in your sbatch script, I would like to add that in my case, my Slurm setup doesn't allow --mem_per_cpu in sbatch, only --mem. So the srun command will still allocate all the memory and block the subsequent steps. The key for me, is to specify --mem_per_cpu (or --mem) in the srun command.

Isabella
  • 391
  • 3
  • 5