I have a 4*64 CPU cluster. I installed SLURM, and it seems to be working, as if i call sbatch
i get the proper allocation and queue. However if i use more than 64 cores (so basically more than 1 node) it perfectly allocates the correct amount of nodes, but if i ssh
into the allocated nodes i only see actual work in one of them. The rest just sits there doing nothing.
My code is complex, and it uses multiprocessing
. I call pools with like 300 workers, so i guess it should not be the problem.
What i would like to achieve is to call sbatch myscript.py
on like 200 cores, and SLURM should distribute my run on these 200 cores, not just allocate the correct amount of nodes but actually only use one.
The header of my python script looks like this:
#!/usr/bin/python3
#SBATCH --output=SLURM_%j.log
#SBATCH --partition=part
#SBATCH -n 200
and i call the script with sbatch myscript.py
.