I have developed a distributed solver for a particular application using MPI. I need to perform a time study by running the solver for different problem sizes and different number of processors to see what kind of speed up the solver achieves over a serial one.
The code works correctly for all combinations of problem sizes and number of processors except one. In this one combination (32 processors - the highest number and the second largest problem), I am getting a segmentation fault. Note that the code works for this problem size if I use a different number of processors (and it also works with these number of processors with other problem sizes).
The executable file named m75 generated by compiling the C++ code is executed like this:
mpirun -n 32 m75
Here's the output error file:
I am not sure what could be wrong with my code given it is working for all other problem instances. How do I debug this? I am running the code on a cluster using slurm job scheduler.
Edit:
As recommended, I generated a core dump and tried to look at it using gdb. I discovered that the segmentation fault was happening at this line:
free(tempr2[temp]);
I did this:
print tempr2[temp]
(double *) 0x31
What does this mean? How do I proceed with this?