0

I'm trying to debug a Fortran MPI program. When I try to run it with 5 processes, I get a segmentation fault. Oddly enough, if I run the same program with fewer processes this doesn't happen.

When running the program with Valgrind (Memcheck) and analyzing the resulting core files (there are 3 of them) with GDB, I get the following output:

Core was generated by `'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000692a186 in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x000000000692a186 in poll () from /lib64/libc.so.6
#1  0x000000000c47763b in btl_openib_async_thread () from /usr/mpi/intel/openmpi-1.4.3/lib/openmpi/mca_btl_openib.so
#2  0x000000000664a73d in start_thread () from /lib64/libpthread.so.0
#3  0x0000000006932f6d in clone () from /lib64/libc.so.6

And when I run the same program without Valgrind, the core files (there are now 4) return this (with different values for itistepfor each core file):

Core was generated by `/home/me/myprogram.out /home/me/run'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002b04b3fa70f0 in ?? ()
(gdb) bt
#0  0x00002b04b3fa70f0 in ?? ()
#1  <signal handler called>
#2  0x00002b04aea65d4f in opal_memory_ptmalloc2_int_free () from /usr/mpi/intel/openmpi-1.4.3/lib/libopen-pal.so.0
#3  0x00002b04aea6a420 in opal_memory_ptmalloc2_free_hook () from /usr/mpi/intel/openmpi-1.4.3/lib/libopen-pal.so.0
#4  0x00002b04af5477f1 in free () from /lib64/libc.so.6
#5  0x00000000005afecc in for_dealloc_allocatable ()
#6  0x00000000005734a2 in mtd () at mtdk.for:533
#7  0x00000000004ecd76 in opt (itistep=1088970925, inamein=Cannot access memory at address 0x146e0
) at opt.for:494
#8  0x000000000042a074 in cali () at cali.f90:379
#9  0x00000000004183cc in main ()

The line pointed to at #6 (mtdk.for:533) looks like this:

if (allocated(done2d))DEALLOCATE (done2D) 

In this program, done2d is a 2-dimensional, allocatable real array that gets allocated in the same subroutine. I don't see anything wrong between the allocation and the deallocate statement. I recompiled my program after adding status= to my deallocate statement, as someone suggested here, but I'm getting the same output.

I'm using Intel Fortran 11.1 with the following flags: -O3 -C -pg -traceback -g -warn interfaces and running my program on CentOS.

Typing ulimitor ulimit -s on the command line returns unlimited.

I don't know where to look next, does somebody know how to use this information to get to the root of the problem?

Community
  • 1
  • 1
m.chips
  • 173
  • 9
  • That seems a bit odd. Did you check the result of the original `allocate(done2d)` statement ? Also, bear in mind that the compiler/run-time will take care of deallocation of local arrays upon return from the subroutine. If you don't need the space immediately, see what happens if you let the processor take care of deallocation. – High Performance Mark Jun 25 '14 at 16:29
  • 1
    Seg faults in deallocs are, in my experience, usually due to you trashing memory elsewhere in the program and so destroying the information in the various tables needed by the deallocate statement. The very first thing I would do is turn on array bounds checking with the compiler and try again, though the fact that it is an MPI program means the problem might be more complicated than a simple wrong index - for instance it could be a receive buffer is not large enough for the message. But bounds checking is definitely where I would start. – Ian Bush Jun 25 '14 at 17:08
  • If I take out the `deallocate` statement, I get another segfault that takes me to another line in the same subroutine. Next, I tried to narrow down the problem by adding a `print` before that line, which caused the program to... run normally. I'm not sure what to make of this, but it seems that the `print` slows down the program enough to, somehow, let it avoid the situation that leads to the segfault... – m.chips Jun 25 '14 at 17:20
  • 1
    Add bounds checking to the compiler line! These are classic symptoms of a bad memory access, which if it is to do with array indices the compiler can pick up for you automatically if you put the right flags on. – Ian Bush Jun 25 '14 at 17:25
  • @IanBush As I understood it, Ifort's `-C` flag covers all possible check options, so shouldn't it catch this kind of errors? Or do you have another flag in mind? – m.chips Jun 25 '14 at 17:26
  • Not all kinds, sometimes the compiler doesn't know the array size. Or you can declare it wrong. – Vladimir F Героям слава Jun 25 '14 at 17:45
  • You should compile your MPI library with the same checking/debugging options, then you'll get somewhere. – steabert Jun 25 '14 at 19:22
  • Allocatable arrays are automatically deallocated when a function exists. If you put allocate and deallocate statements in different subroutines you are asking for trouble. The cure (though not perfect) is to use pointers instead of allocatable arrays. – Sturla Molden Jun 25 '14 at 22:51
  • Can you repeat the error with different number of nodes? different compiler? – user1824346 Jul 02 '14 at 09:33
  • Sorry for not replying for a while - the problem eventually disappeared, so that I was able to continue working, but I still have no idea what caused it. So, I guess it's either a problem that only appears under certain conditions, or a problem of the cluster I'm working on rather than a problem of my code (as other users have had similar problems in the last days). – m.chips Jul 18 '14 at 14:10

0 Answers0