1

I have developed a distributed solver for a particular application using MPI. I need to perform a time study by running the solver for different problem sizes and different number of processors to see what kind of speed up the solver achieves over a serial one.

The code works correctly for all combinations of problem sizes and number of processors except one. In this one combination (32 processors - the highest number and the second largest problem), I am getting a segmentation fault. Note that the code works for this problem size if I use a different number of processors (and it also works with these number of processors with other problem sizes).

The executable file named m75 generated by compiling the C++ code is executed like this:

         mpirun -n 32 m75

Here's the output error file:

https://pastebin.com/pfN7b8CB

I am not sure what could be wrong with my code given it is working for all other problem instances. How do I debug this? I am running the code on a cluster using slurm job scheduler.

Edit:

As recommended, I generated a core dump and tried to look at it using gdb. I discovered that the segmentation fault was happening at this line:

free(tempr2[temp]);

I did this:

print tempr2[temp]
(double *) 0x31

What does this mean? How do I proceed with this?

Sanit
  • 80
  • 9
  • At best, use a parallel debugger capable to debug MPI programs, such as DDT or TotalView. Unfortunately, these are usually not for free. Does your cluster have any documentation? Some debuggers may be mentioned there. Alternatively, you can attach an "ordinary" debugger such as GDB to your running processes, but this can be a bit tedious. – Daniel Langr Sep 05 '20 at 13:17
  • Parallel debugger such as DDT (commercial) is very helpful. Otherwise, build your app with `-g`, have it generate a core dump and debug it post Morten. – Gilles Gouaillardet Sep 05 '20 at 13:19
  • @DanielLangr Is there anything I can attach to my code that will simply tell me which line is causing the segmentation fault? I don't know if I can manually debug this because the error occurs hours after the code has started running. – Sanit Sep 05 '20 at 13:19
  • That can be a problem a bit. The best debugging is with disabled optimizations and debugging info. However, disabling optimization can cause your program run even much longer. As Gilles also suggested, it's quite a good idea to enable core dumps and open this in a debugger after crash. Note that a core dump may require very large storage space in HPC apps. – Daniel Langr Sep 05 '20 at 13:21
  • @GillesGouaillardet I don't have access to a commercial debugger. To build it with -g, do I simply compile it with the -g flag and then run the executable with mpirun? – Sanit Sep 05 '20 at 13:22
  • 2
    @Sanit As we wrote, you may need to [enable core dumps](https://stackoverflow.com/questions/17965/how-to-generate-a-core-dump-in-linux-on-a-segmentation-fault) if not enabled by default on your system. – Daniel Langr Sep 05 '20 at 13:25
  • 1
    If you're using GCC or Clang to compile your code, try enabling AddressSanitizer when you compile. It will give you much more useful debug information when a segfault happens, and make it way easier to determine the cause of the problem. – Aziz Sep 05 '20 at 16:20
  • @DanielLangr Thanks! Running my code again after enabling core dumps. – Sanit Sep 06 '20 at 06:27
  • @Aziz How would I go about enabling AddressSanitizer for mpicxx (which is a wrapper compiler around g++)? – Sanit Sep 06 '20 at 06:30
  • @Sanit Just pass `-fsanitize=address` with the other compiler flags. – Aziz Sep 06 '20 at 13:31
  • @Aziz When I tried to compile on the cluster using this flag, here's what I got: `cannot find /usr/lib64/libasan.so.0.0.0` – Sanit Sep 06 '20 at 15:30
  • I wrote something about debugging Open MPI [here](https://stackoverflow.com/a/62180459/1374437). You may find it helpful. – Hristo Iliev Sep 06 '20 at 17:23
  • @Sanit Which operating system are you using? and which version of gcc? Try it with `-lasan`. If it doesn't work, you may need to install `libasan` – Aziz Sep 06 '20 at 18:51
  • @DanielLangr I generated the core dump and ran gdb on it. I have added the results to the question. – Sanit Sep 08 '20 at 11:43
  • There is likely some memory corruption in your program. For instance, when you call `free(tempr2[temp]);`, `tempr2[temp]` might not be a pointer allocated with `malloc` or its relatives. Or, some buffer overflow may happen in heap data. Have you also tried to use the _address sanitizer_ as recommended? There are also memory debuggers such as Valgrind, but they have huge runtime overhead due to virtualization, thus they may be not usable for many-hours MPI program runs. – Daniel Langr Sep 09 '20 at 07:42
  • Thank you so much everyone. After storing the core dump and finding the line at which the segmentation fault was happening, I was able to find the cause of the error. It is fixed now and the code is working now. – Sanit Sep 11 '20 at 12:26

0 Answers0