Using MPI on Slurm, is there a way to send messages between separate jobs?

Question

I'm new to using MPI (mpi4py) and Slurm. I need to run about 50000 tasks, so to obey the administrator-set limit of about 1000, I've been running them like this:

sbrunner.sh:

#!/bin/bash
for i in {1..50}
do
   sbatch m2slurm.sh $i
   sleep 0.1
done

m2slurm.sh:

#!/bin/bash
#SBATCH --job-name=mpi
#SBATCH --output=mpi_50000.out
#SBATCH --time=0:10:00
#SBATCH --ntasks=1000

srun --mpi=pmi2 --output=mpi_50k${1}.out python par.py data_50000.pkl ${1} > ${1}py.out 2> ${1}.err

par.py (irrelevant stuff omitted):

offset = (int(sys.argv[2])-1)*1000
comm = MPI.COMM_WORLD
k = comm.Get_rank()
d = data[k+offset]

# ... do something with d ...

allresults = comm.gather(result, root=0)
comm.Barrier()
if k == 0:
    print(allresults)

Is this a sensible way to get around the limit of 1000 tasks?
Is there a better way to consolidate results? I now have 50 files I have to concatenate manually. Is there some comm_world that can exist between different jobs?

To your question #2, how about appending one more line of script after for loop to combine all files together, for example: https://stackoverflow.com/questions/2150614/bash-shell-scripting-combining-txt-into-one-file — Ryan L, Jul 08 '17 at 23:10
@RyanL, what I'm doing now is akin to this. I'm wondering if there's a more elegant way to handle this. MPI's comm system seems to exist to save the need for this kind of concatenation. But at this point I'm still just doing it myself. — pyg, Jul 08 '17 at 23:35
I see. Sorry for misunderstanding. Does the order of the files during concatenation matter? If so, it might be more advisable to handle the consolidation after for loop, if I understand your response correctly. — Ryan L, Jul 08 '17 at 23:46
No problem. Concatenation isn't the problem. I have a separate script right now that concatenates and does post-processing of the results that are printed. However, if I want to automate that, I'd have to run that after the for-loop and poll regularly to see if all of the jobs are completed. This seems like kind of a hacky way to do it and I'm wondering if there's a better way. — pyg, Jul 09 '17 at 00:00
**Consult your system administrator!** The limit is probably set for a reason, you should respect that. In particular, submitting large amounts of jobs in short bursts can degrade overall batch system responsiveness, even lead to timeouts and errors. — Zulan, Jul 09 '17 at 06:38

score 1 · Accepted Answer · answered Jul 09 '17 at 17:41

I think you need to make your application divide the work among 1000 tasks (MPI ranks) and consolidate the results after that with MPI collective calls i.e. MPI_Reduce or MPI_AllReduce calls.

trying to work around the limit won't help you as the jobs you started will be queued one after another.

Jobs arrays will give similar behavior like what you did in the batch file you provided. So still your application must be able to processes all data items given only N tasks(MPI ranks).

No need to pool to make sure all other jobs are finished take a look at slurm job dependency parameter https://hpc.nih.gov/docs/job_dependencies.html

Edit:

You can use job dependeny to make a new job that will run after all other jobs finish and this job will collect the results and merge them into one big file. I still believe you are over thinking the obvious solution make rank 0 (master collect all results and save them to the disk)

I think job dependencies are the way to go. Thank you! Not sure I understand completely about the obvious solution. I used rank 0 to consolidate (as shown in par.py). Do you mean to say the better way to go about this is to parameterize the function so that it takes 50000/1000 = 50 "calls" in serial assigned to each task? — pyg, Jul 09 '17 at 17:52
sorry I was traveling and didn't had a chance to check my SO. Yes then you can parallelize it with OpenMP later on to fully utilize the multi-cores available to each node. — Mahmoud Fayez, Jul 14 '17 at 05:36

score 0 · Answer 2 · answered Jul 09 '17 at 17:31

0

This looks like a perfect candidate for job arrays. Each job in an array is identical with the exception of a $SLURM_ARRAY_TASK_ID environment variable. You can use this in the same way that you're using the command line variable.

(You'll need to check that MaxArraySize is set high enough by your sysadmin. Check the output of scontrol show config | grep MaxArraySize )

answered Jul 09 '17 at 17:31

ciaron

1,089
7
15

That command is definitely helpful. Still wondering if there is a way to reduce among the entire job array. I'll play around with it and report back. – pyg Jul 09 '17 at 17:46

score 0 · Answer 3 · answered Jul 10 '17 at 04:24

what do you mean by 50000 tasks ?

do you mean one MPI job with 50000 MPI tasks ?
or do you mean 50000 independant serial programs ?
or do you mean any combination where (number of MPI jobs) * (number of tasks per job) = 5000

if 1), well, consult your system administrator. of course you can allocate 50000 slots in multiple SLURM jobs, manually wait they are all running at the same time and then mpirun your app outside of SLURM. this is both ugly and unefficient, and you might get in trouble if this is seen as an attempt to circumvent the system limits.

if 2) or 3), then job array is a good candidate. and if i understand correctly your app, you will need an extra post processing step in order to concatenate/merge all your outputs in a single file.

and if you go with 3), you will need to find the sweet spot (generally speaking, 50000 serial program are more efficient than fifty 1000 tasks MPI jobs or one 50000 tasks MPI program but merging 50000 file is less efficient than merging 50 files (or not merging anything at all)

and if you cannot do the post-processing on a frontend node, then you can use job dependency to start it after all the computation have completed

Using MPI on Slurm, is there a way to send messages between separate jobs?

3 Answers3