0

I was trying mpiexec command, and it returned some sigsev error code. However, problem is not about why the error occured, but how error is shown.

When we look at error code below,

[songyi719-thinkpad-x1-extreme-2nd:172415] *** Process received signal ***
[songyi719-thinkpad-x1-extreme-2nd:172415] Signal: Segmentation fault (11)
[songyi719-thinkpad-x1-extreme-2nd:172415] Signal code: Address not mapped (1)
[songyi719-thinkpad-x1-extreme-2nd:172415] Failing at address: 0x440000e8
[songyi719-thinkpad-x1-extreme-2nd:172412] *** Process received signal ***
[songyi719-thinkpad-x1-extreme-2nd:172412] Signal: Segmentation fault (11)
[songyi719-thinkpad-x1-extreme-2nd:172412] Signal code: Address not mapped (1)
[songyi719-thinkpad-x1-extreme-2nd:172412] Failing at address: 0x440000e8
[songyi719-thinkpad-x1-extreme-2nd:172413] *** Process received signal ***
[songyi719-thinkpad-x1-extreme-2nd:172413] Signal: Segmentation fault (11)
[songyi719-thinkpad-x1-extreme-2nd:172413] Signal code: Address not mapped (1)
[songyi719-thinkpad-x1-extreme-2nd:172413] Failing at address: 0x440000e8
[songyi719-thinkpad-x1-extreme-2nd:172415] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f0c3a59e3c0]
[songyi719-thinkpad-x1-extreme-2nd:172415] [ 1] /usr/local/lib/libmpi.so.40(MPI_Comm_rank+0x3b)[0x7f0c3a78771b]
[songyi719-thinkpad-x1-extreme-2nd:172415] [ 2] ./data(+0x3a432)[0x562c1fab5432]
[songyi719-thinkpad-x1-extreme-2nd:172415] [ 3] ./data(+0x98d9)[0x562c1fa848d9]
[songyi719-thinkpad-x1-extreme-2nd:172415] [ 4] [songyi719-thinkpad-x1-extreme-2nd:172413] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fe5dd1ec3c0]
[songyi719-thinkpad-x1-extreme-2nd:172413] [ 1] /usr/local/lib/libmpi.so.40(MPI_Comm_rank+0x3b)[0x7fe5dd3d571b]
[songyi719-thinkpad-x1-extreme-2nd:172413] [songyi719-thinkpad-x1-extreme-2nd:172412] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f021418a3c0]
[songyi719-thinkpad-x1-extreme-2nd:172412] [ 1] /usr/local/lib/libmpi.so.40(MPI_Comm_rank+0x3b)[0x7f021437371b]
[songyi719-thinkpad-x1-extreme-2nd:172412] [ 2] [songyi719-thinkpad-x1-extreme-2nd:172414] *** Process received signal ***
[songyi719-thinkpad-x1-extreme-2nd:172414] Signal: Segmentation fault (11)
[songyi719-thinkpad-x1-extreme-2nd:172414] Signal code: Address not mapped (1)
[songyi719-thinkpad-x1-extreme-2nd:172414] Failing at address: 0x440000e8
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f0c3a3be0b3]
[songyi719-thinkpad-x1-extreme-2nd:172415] [ 5] ./data(+0xa33e)[0x562c1fa8533e]
[songyi719-thinkpad-x1-extreme-2nd:172415] *** End of error message ***
[songyi719-thinkpad-x1-extreme-2nd:172414] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fc68e9043c0]
[songyi719-thinkpad-x1-extreme-2nd:172414] [ 1] /usr/local/lib/libmpi.so.40(MPI_Comm_rank+0x3b)[0x7fc68eaed71b]
[songyi719-thinkpad-x1-extreme-2nd:172414] [ 2] ./data(+0x3a432)[0x55e7f5786432]
[songyi719-thinkpad-x1-extreme-2nd:172414] [ 3] ./data(+0x98d9)[0x55e7f57558d9]
[songyi719-thinkpad-x1-extreme-2nd:172414] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fc68e7240b3]
[songyi719-thinkpad-x1-extreme-2nd:172414] [ 5] ./data(+0xa33e)[0x55e7f575633e]
[songyi719-thinkpad-x1-extreme-2nd:172414] *** End of error message ***
[ 2] ./data(+0x3a432)[0x560705a04432]
[songyi719-thinkpad-x1-extreme-2nd:172413] [ 3] ./data(+0x98d9)[0x5607059d38d9]
[songyi719-thinkpad-x1-extreme-2nd:172413] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fe5dd00c0b3]
[songyi719-thinkpad-x1-extreme-2nd:172413] [ 5] ./data(+0xa33e)[0x5607059d433e]
[songyi719-thinkpad-x1-extreme-2nd:172413] *** End of error message ***
./data(+0x3a432)[0x559eacf7a432]
[songyi719-thinkpad-x1-extreme-2nd:172412] [ 3] ./data(+0x98d9)[0x559eacf498d9]
[songyi719-thinkpad-x1-extreme-2nd:172412] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f0213faa0b3]
[songyi719-thinkpad-x1-extreme-2nd:172412] [ 5] ./data(+0xa33e)[0x559eacf4a33e]
[songyi719-thinkpad-x1-extreme-2nd:172412] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 0 on node songyi719-thinkpad-x1-extreme-2nd exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

As you can see, same error code is mixed and repeated 4 times. I deleted and re-installed openmpi, but still error repeats 4 times.

How can this happen? How can I change this error to one non-repeated simple error code?

songyi719
  • 3
  • 1
  • 4
  • 1
    Open MPI has builtin mechanism to aggregate `MPI_Abort()`, but I am afraid there is nothing to aggregate stack traces from multiple crashes. – Gilles Gouaillardet Feb 03 '21 at 00:32
  • To tack on to the above answer, what you *can* do is write your own signal handler for `SIGSEV` that calls `MPI_Abort()` and hope all ranks reach it before they are killed by OS. But this is really quite fragile since `SIGSEV` is sticky (i.e. even if you do "handle" it, it will keep getting re-thrown after your handler returns) and so not really recommended. This [SO post](https://stackoverflow.com/questions/10202941/segmentation-fault-handling) is fairly informative. – Jacob Faib Feb 03 '21 at 02:12
  • Aggregation of distributed stack traces is hard enough to be done in the error handling part of an MPI implementation. There are special tools such as [STAT](https://github.com/LLNL/STAT). – Hristo Iliev Feb 03 '21 at 09:55

0 Answers0