4

I am trying to submit a job with slurm. However, the job fails if I use srun or mpirun. However, it runs fine with mpiexec, albeit running with only single process despite multiple nodes and multiple cores being allocated.

The actual command used is:

srun /nfs/home/6/sanjeevis/dns/lb3d/src/lbe -f input-default

Following is the error I get with srun/mpirun:

[mpiexec@n1581] match_arg (utils/args/args.c:163): unrecognized argument pmi_args
[mpiexec@n1581] HYDU_parse_array (utils/args/args.c:178): argument matching returned error
[mpiexec@n1581] parse_args (ui/mpich/utils.c:1642): error parsing input array
[mpiexec@n1581] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments

The code compiles fine but I am facing issues through slurm. Any help on this is appreciated.

Edit: Here are the output for which mpirun, mpiexec, and ldd of the executable:

/nfs/apps/MPI/openmpi/3.1.3/gnu/6.5.0/cuda/9.0/bin/mpirun
/nfs/apps/ParaView/5.8/binary/bin/mpiexec
        linux-vdso.so.1 =>  (0x00007fff78255000)
        libmpi.so.12 => /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/intel64/lib/release_mt/libmpi.so.12 (0x00002ae6cb57d000)
        libz.so.1 => /nfs/apps/Libraries/zlib/1.2.11/system/lib/libz.so.1 (0x00002ae6cbd4c000)
        libmpifort.so.12 => /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/intel64/lib/libmpifort.so.12 (0x00002ae6cbf67000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ae6cc315000)
        librt.so.1 => /lib64/librt.so.1 (0x00002ae6cc519000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ae6cc721000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ae6cc93e000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ae6ccc40000)
        libgcc_s.so.1 => /nfs/apps/Compilers/GNU/6.5.0/lib64/libgcc_s.so.1 (0x00002ae6cd003000)
        /lib64/ld-linux-x86-64.so.2 (0x0000558ea723a000)

Here is my job script.

denfromufa
  • 5,610
  • 13
  • 81
  • 138
SKPS
  • 5,433
  • 5
  • 29
  • 63
  • Is https://stackoverflow.com/questions/33780992/mpirun-unrecognized-argument-mca of any help? – Ian Bush Jun 21 '20 at 19:32
  • Not really, because there, the user uses `--mca` flag and gets `unrecognized argument mca` in error. In my case, I dont pass `pmi_args` as an argument and dont know where it is coming from. – SKPS Jun 21 '20 at 21:31
  • 1
    in your batch script, add `which mpirun`, `which mpiexec` and `ldd /nfs/home/6/sanjeevis/dns/lb3d/src/lbe` to figure out which MPI library is used and by who. – Gilles Gouaillardet Jun 24 '20 at 09:41
  • @francescalus: Thanks. I have updated the jobscript and few more details which other user has asked for. I am not aware about configuring PMI. How can I get this information? – SKPS Jun 24 '20 at 12:34
  • @GillesGouaillardet: Updated the required info and jobscript. – SKPS Jun 24 '20 at 12:34
  • you are mixing three MPI implementations! `mpirun` is from Open MPI, mpiexec is likely the builtin MPICH from Paraview, and your app is built with Intel MPI. try using `/nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin/mpirun` (or `/nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin64/mpirun`) instead. – Gilles Gouaillardet Jun 24 '20 at 12:41
  • or if you want to use `srun`, you first need to `export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so` – Gilles Gouaillardet Jun 24 '20 at 12:46
  • @GillesGouaillardet: Thanks much! I pointed the mpi to a different version and it works fine now. I was using default MPI which was a different version. Have to figure why the inconsistency came through. Since you have helped me with finer details, I can give you the bounty if you would like. – SKPS Jun 24 '20 at 13:07
  • Thanks, I posted my comments as an answer. I am not familiar with the rules of the bounty, and if it is possible, I invite you to consider sharing the bounty with @denfromufa who was the first to give a hint about the root cause. – Gilles Gouaillardet Jun 25 '20 at 05:28
  • @GillesGouaillardet actually Ian Bush was the first to point the problem in the first comment. It is just SKPS did not notice that point in the linked answer and I noticed it after answering. – denfromufa Jun 25 '20 at 05:49
  • I did not click the link ... let's consider giving everyone his fair share then! – Gilles Gouaillardet Jun 26 '20 at 05:50
  • @denfromufa: I did go through the answer from Ian Bush and it does not help this case. Please see my response to Ian Bush's comment. – SKPS Jun 26 '20 at 13:29
  • @SKPS did you see this in Ian's link? "A typical case of a mix-up of multiple MPI implementations." – denfromufa Jun 26 '20 at 14:28
  • @denfromufa: Thanks! The answer matches this issue. However, I feel it is a bit more subtle here because I am not passing any flag during execution unlike there (`mca`). So I earlier felt that it is more complicated here. `pmi_args` error seems common but not addressed in detail. So I feel this Q&A would help the forum. – SKPS Jun 26 '20 at 16:18

2 Answers2

2

The most likely problem is that the program is compiled with one MPI implementation and called with another MPI implementation. Make sure that all MPI environment variables are set correctly: OPAL_PREFIX, MPI_ROOT, PATH, and LD_LIBRARY_PATH.

denfromufa
  • 5,610
  • 13
  • 81
  • 138
2

The root cause is the mix of several MPI implementations that do not inter operate :

  • mpirun is from Open MPI
  • mpiexec is likely the builtin MPICH from Paraview
  • your app is built with Intel MPI.

Try using /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin/mpirun (or /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin64/mpirun) instead so the launcher will match your MPI library.

If you want to use srun with Intel MPI, an extra step is required. You first need to

export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so
Gilles Gouaillardet
  • 8,193
  • 11
  • 24
  • 30