0

I am currently developing a program written in C++ with the MPI+pthread paradigm.

I add some functionality to my program, however I have a bad termination message from one MPI process, like this:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 37805 RUNNING AT node165
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@node162] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:0@node162] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@node166] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:2@node166] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@node166] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: node162: task 0: Exited with exit code 7
[proxy:0:0@node162] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
srun: error: node166: task 2: Exited with exit code 7
[mpiexec@node162] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@node162] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@node162] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@node162] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion

My problem is such that I have no idea about why I have this kind of message, and thus how to correct it.

I use only some basic functions from MPI, and ensure that there is no threads which uses MPI calls (only my "master process" is allowed to call such functions).

I also checked that one process does not send message to itself, and that the process destination exist before sending a message.

My question is quite simple: how to know where the problem comes from to then debug my application ?

Thank you a lot.

2 Answers2

1

one of your processes has had a segmentation fault. This means reading from or writing to an area of memory that it is not permitted to.

That's the cause and MPI functions often are difficult to get right the first time - for example it could be MPI send and receive functions with incorrect sizes or locations.

The best solution is to fire up a parallel debugger so that you can watch all the processes. It looks like you are using a proper HPC system so there is a chance that there is one installed on the system -- ddt or totalview are the most popular.

Take a look at How to debug an MPI program

David
  • 756
  • 5
  • 10
  • at first, you can force a core to be generated `ulimit -c unlimited; mpirun ...` and debug it post-mortem with `gdb` (e.g. no need for a parallel debugger). If the error does not make any sense at all, try `ulimit -s unlimited; mpirun ...` and see if it helps. – Gilles Gouaillardet Oct 18 '18 at 23:55
  • ulimit won't work that easily: the example is on two nodes (node162 is mpirun, but node165 is where it crashed) and ulimit would not follow through to node165. You can (for most MPIs) get mpirun to execute a script instead, to set ulimit -c followed by executing the application. Note: ulimit -c is not a good practice for HPC, because when you go to higher scales, each process will possibly crash at same time, and generate a Gb of core dump file - which all get written to filesystem at same time. – David Oct 20 '18 at 10:33
  • On a well configured cluster, ulimits are propagated from `mpirun`. Obviously `ulimit -c unlimited` should not be the default. And unless you have an unlimited budget, (commercial) parallel debuggers do not scale either. – Gilles Gouaillardet Oct 20 '18 at 14:00
0

My experience with this problem when writing in C++ and using MPI is that this frequently occurred when I did not set MPI_Finalze(); before every return statement.

Paka101
  • 91
  • 5