1

I am using Intel MPI and have encountered some confusing behavior when using mpirun in conjunction with slurm.

If I run (in a login node)

mpirun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank())"

then I get as output the expected 0 and 1 printed out.

If however I salloc --time=30 --nodes=1 and run the same mpirun from the interactive compute node, I get two 0s printed out instead of the expected 0 and 1.

Then, if I change -n 2 to -n 3 (still in compute node), I get a large error from slurm saying srun: error: PMK_KVS_Barrier task count inconsistent (2 != 1) (plus a load of other stuff), but I am not sure how to explain this either...

Now, based on this OpenMPI page, it seems these kind of operations should be supported at least for OpenMPI:

Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.

Maybe the Intel MPI implementation I was using just doesn't have the same support and is not designed to be used directly in a slurm environment (?), but I am still wondering: what is the underlying nature of mpirun and slurm (salloc) that this is the behavior produced? Why would it print two 0s in the first "case," and what are the inconsistent task counts it talks about in the second "case"?

Grayscale
  • 1,462
  • 1
  • 13
  • 20
  • Actually a whole variety of errors can be produced... changing to `--nodes=2` in the `salloc` and running the `mpirun` produces a `BAD TERMINATION` error from Intel MPI, using `mpiexec` instead of `mpirun` produces `srun: error: PMK_KVS_Barrier duplicate request from task 0`, and the list probably goes on. Am I just not understanding how `mpirun` / slurm should be used? – Grayscale Jul 12 '18 at 07:44
  • note IntelMPI is based on MPICH (and not Open MPI) – Gilles Gouaillardet Jul 12 '18 at 07:51
  • @GillesGouaillardet I have heard OpenMPI described as an "MPICH compatible library" though, so I would expect the behavior to mostly be similar? – Grayscale Jul 12 '18 at 07:57
  • Open MPI and MPICH both implement the same MPI standard. That only means the same code can be built without any changes with the library of your choice. That being said, there is **no** binary compatibility and you cannot mix `mpirun` from one implementation with the library of an other one. – Gilles Gouaillardet Jul 12 '18 at 08:03
  • @GillesGouaillardet Yeah this makes sense - this is why I said maybe Intel MPI just doesn't have the same support. I am still wondering though how to explain those outputs. – Grayscale Jul 12 '18 at 08:07

0 Answers0