OpenMPI has the MCA parameter mpi_abort_print_stack
for trace backs after an MPI_ABORT
, but the back trace is quite limited in details. For example, if I compile (mpicxx broadcast.cxx
) and run (mpiexec --mca mpi_abort_print_stack 1 -n 4 ./a.out
) this simple example code:
#include <mpi.h>
const int N = 10;
int arr[N];
int main(int argc, char** argv)
{
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// Correct:
MPI_Bcast(arr, N, MPI_INT, 0, MPI_COMM_WORLD);
// Incorrect:
const int my_size = (rank == 1) ? N+1 : N;
MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
I get the following output and back trace:
[manjaro:2223406] *** An error occurred in MPI_Bcast
[manjaro:2223406] *** reported by process [1082195969,3]
[manjaro:2223406] *** on communicator MPI_COMM_WORLD
[manjaro:2223406] *** MPI_ERR_TRUNCATE: message truncated
[manjaro:2223406] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[manjaro:2223406] *** and potentially your MPI job)
[manjaro:2223406] [0] func:/usr/lib/libopen-pal.so.40(opal_backtrace_buffer+0x3b) [0x7f3681231a9b]
[manjaro:2223406] [1] func:/usr/lib/libmpi.so.40(ompi_mpi_abort+0x160) [0x7f368183f040]
[manjaro:2223406] [2] func:/usr/lib/libmpi.so.40(ompi_mpi_errors_are_fatal_comm_handler+0xb9) [0x7f36818369b9]
[manjaro:2223406] [3] func:/usr/lib/libmpi.so.40(ompi_errhandler_invoke+0xd3) [0x7f3681836ab3]
[manjaro:2223406] [4] func:/usr/lib/libmpi.so.40(PMPI_Bcast+0x455) [0x7f36818514b5]
[manjaro:2223406] [5] func:./a.out(+0x7c48) [0x560d4d420c48]
[manjaro:2223406] [6] func:/usr/lib/libc.so.6(+0x29290) [0x7f36812d2290]
[manjaro:2223406] [7] func:/usr/lib/libc.so.6(__libc_start_main+0x8a) [0x7f36812d234a]
[manjaro:2223406] [8] func:./a.out(+0x7ac5) [0x560d4d420ac5]
So it tells me that there is a problem in some MPI_Bcast
call, but not exactly which one.
Is it possible to get a more detailed back trace, including e.g. line numbers?