Can MPI oversubscribing crash the system?

Question

I have implemented a sample MPI application with a Producer and a Consumer. The producer runs on a process with rank 0 and the consumer will be running on all the non-zero ranks. The consumer will be spawning consumer worker threads to process the messages generated by the producer. The consumer threads are split into a receiver thread and worker threads.

The consumer receiver thread executes recv and upon receive passes the message to be consumed by the consumer worker which after performing the computation, sends the processed message back to the Producer(root).

I am running this code on my dual core machine. What I am noticing is that when I execute my application using mpirun -np 2, the application is performing just fine for any number of generated messages by the producer. When I try running the application with mpirun -np 4, the application crashes after processing a couple of runs.

Has somebody encountered this issue before? It would be great to get some insight into why this might be happening.

Edit: Here's the exception that I get everytime I run my run my application:

*** glibc detected *** application: free(): invalid pointer: 0x00007f67d1f9f9e0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7e626)[0x7f67d0671626]
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x9041)[0x7f67cc790041]
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x5a00)[0x7f67cc78ca00]
/usr/lib/libmpi.so.0(MPI_Recv+0x154)[0x7f67d1d531e4]
/usr/local/lib/libboost_mpi.so.1.50.0(_ZN5boost3mpi6detail19packed_archive_recvEP19ompi_communicator_tiiRNS0_15packed_iarchiveER20ompi_status_public_t+0x33)[0x7f67d1fcb223]
/usr/local/lib/libboost_mpi.so.1.50.0(_ZNK5boost3mpi12communicator4recvINS0_15packed_iarchiveEEENS0_6statusEiiRT_+0x45)[0x7f67d1fc4755]
application(_ZNK5boost3mpi12communicator9recv_implI7MessageEENS0_6statusEiiRT_N4mpl_5bool_ILb0EEE+0x74)[0x464d98]
application(_ZNK5boost3mpi12communicator4recvI7MessageEENS0_6statusEiiRT_+0x3b)[0x46479b]
application(_ZN12WorkerReceiver3runEv+0xac)[0x46b1da]
/usr/local/lib/libPocoFoundation.so.12(_ZN4Poco10ThreadImpl13runnableEntryEPv+0x96)[0x7f67d26fcb16]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f67d09b7e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f67d06e54bd]

Thanks

Do you have a core to debug? If you haven't got a minimal example of what crashes, please post some code (at least communication). — Michael Foukarakis, Jul 28 '12 at 05:43
Thanks for looking at it. I have added the backtrace with my post. Hopefully that will help. — sc_ray, Jul 28 '12 at 15:08
The error in `free` means heap corruption. Check that your message sizes are correct in both the MPI calls and in the code that processes them. — Greg Inozemtsev, Jul 28 '12 at 21:02
@GregInozemtsev - Thanks. Its the same message object that is being used on the send and the recv. Is there some other way to check that the message sizes are okay. Also, is it possible for the message buffer on the receiver to overflow or run out of memory because of the deluge of messages sent from rank 0? — sc_ray, Jul 28 '12 at 22:25
There shouldn't be a problem with overruns - I'd expect that messages will just stop getting delivered after a certain point - until enough matching receives are posted. Since you are using threads, are you initializing MPI with `MPI_Init_thread`? I think Boost::MPI will check if MPI has already been initialized, so just put that call before instantiating an `environment`. — Greg Inozemtsev, Jul 29 '12 at 05:45
@GregInozemtsev - Thanks for the MPI_INIT_Thread tip. I put the following before the environment `MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided)` but that didn't seem to do the trick. I am still getting abnormal termination sometimes with the error in free and sometimes with just an abrupt termination. Is there something else I can look into? — sc_ray, Jul 29 '12 at 20:45
Also, the back trace always being the same(pointing to the MPI_RECV) is rather suspicious. — sc_ray, Jul 29 '12 at 20:48
@sc_ray It could still be a heap corruption elsewhere in the program. I would try the suggestions [here](http://stackoverflow.com/questions/1010106/how-to-debug-heap-corruption-errors). You could also try building a debugging version of OpenMPI to see exactly where the crash is. — Greg Inozemtsev, Jul 29 '12 at 22:36
@GregInozemtsev Thanks for your responses. I will look into building a debug version of Open MPI to narrow down the issue. Although I haven't been able to solve my issue, if you could consolidate your comments into an answer, I might be able to accept it. — sc_ray, Jul 30 '12 at 04:25

Can MPI oversubscribing crash the system?

0 Answers0