How to use all allocated nodes with python on a HPC cluster

Question

I have a HPC cluster with SLURM installed. I can properly allocate nodes and cores for myself. I would like to be able to use all the allocated cores regardless of the node they are in. As i seen in this thread Using the multiprocessing module for cluster computing this cannot be achieved with multiprocessing.

My script look like this (oversimplified version):

def func(input_data):
    #lots of computing
    return data

parallel_pool = multiprocessing.Pool(processes=300)
returned_data_list = []
for i in parallel_pool.imap_unordered(func, lots_of_input_data)
    returned_data_list.append(i)
# Do additional computing with the returned_data
....

This script works perfectly fine, however as i mentioned multiprocessing is not a good tool for me, as even if SLURM allocated 3 nodes for me, multiprocessing can only use one. As far as i understand this is a limitation of multiprocessing.

I could use the srun protocol of SLURM, but that ust executes the same script N times, and i need additional computing with the output of the parallel processes. I could of course store the outputs somewhere, and ream em back in, but there must be some more elegant solution.

In the mentioned thread there are suggestions like jug, but as i was reading through it i havet found a solution for myself.

Maybe py4mpi can be a solution for me? The tutorials for that seems very messy, and i havent found a specific solution for my problem in there neither. (run a function in parallel with mpi, and then continue the script).

I tried subprocess calls, but the seem to work the same way as multiprocess calls, so they only run on one node. I havent found any confirmation of this, so this is only from my trial-and-error guess.

How can i overcome this problem? Currently i could use more than 300 cores, but one node only have 32, so if i could find a solution then i would be able to run my project nearly 10 times as fast.

Thanks

Communication between nodes is usually mpi, but I don't know of a library which wraps it in an api like the multiprocessing one — Jon Chesterfield, Dec 01 '16 at 12:55

score 2 · Accepted Answer · answered Dec 02 '16 at 13:22

2

After a lot of trouble scoop was the library that solved my problem.

answered Dec 02 '16 at 13:22

Gábor Erdős

3,599
4
24
56

Hello, I'm working on HPC cluster PBS based, and using a scoop, can you tell me what does hostfile mean in the command : python -m scoop --hostfile hosts -vv -n 6 your_program.py [your arguments] and how what should be present in that host file – Strange Jun 11 '20 at 14:16

How to use all allocated nodes with python on a HPC cluster

1 Answers1