3

I've written a large program and I'm having a really hard time tracking down a segmentation fault. I posted a question but I didn't have enough information to go on (see link below - and if you do, note that I spent almost an entire day trying several times to come up with a minimally compilable version of the code that reproduced the error to no avail).

https://stackoverflow.com/questions/16025411/phantom-bug-involving-stdvectors-mpi-c

So now I'm trying my hand at valgrind for the first time. I just installed it (simply "sudo apt-get install valgrind") with no special installation to account for MPI (if there is any). I'm hoping for concrete information including file names and line numbers (I understand it's impossible for valgrind to provide variable names). While I am getting useful information, including

  • Invalid read of size 4
  • Conditional jump or move depends on uninitialised value(s)
  • Uninitialised value was created by a stack allocation
  • 4 bytes in 1 blocks are definitely lost

in addition to this magical thing

  • Syscall param sched_setaffinity(mask) points to unaddressable byte(s) at 0x433CE77: syscall (syscall.S:31) Address 0x0 is not stack'd, malloc'd or (recently) free'd

I am not getting file names and line numbers. Instead, I get

==15095==    by 0x406909A: ??? (in /usr/lib/openmpi/lib/libopen-rte.so.0.0.0)

Here's how I compile my code:

mpic++ -Wall -Wextra -g -O0 -o Hybrid.out (…file names)

Here are two ways I've executed valgrind:

valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log.txt mpirun -np 1 Hybrid.out

and

mpirun -np 1 valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log4.txt -v ./Hybrid.out

The second version based on instructions in

Segmentation faults occur when I run a parallel program with Open MPI

which, if I'm understanding the chosen answer correctly, appears to be contradicted by

openmpi with valgrind (can I compile with MPI in Ubuntu distro?)

I am deliberately running valgrind on one processor because that's the only way my program will execute to completion without the segmentation fault. I have also run it with two processors, and my program seg faulted as expected, but the log I got back from valgrind seemed to contain essentially the same information. I'm hoping that by resolving the issues valgrind reports on one processor, I'll magically solve the issue happening on more than one.

I tried to include "-static" in the program compilation as suggested in

Valgrind not showing line numbers in spite of -g flag (on Ubuntu 11.10/VirtualBox)

but the compilation failed, saying (in addition to several warnings)

dynamic STT_GNU_IFUNC symbol "strcmp" with pointer equality in '…' can not be used when making an executably; recompile with fPIE and relink with -pie

I have not looked into what "fPIE" and "-pie" mean. Also, please note that I am not using a makefile, nor do I currently know how to write one.

A few more notes: My code does not use the commands malloc, calloc, or new. I'm working entirely with std::vector; no C arrays. I do use commands like .resize(), .insert(), .erase(), and .pop_back(). My code also passes vectors to functions by reference and constant reference. As for parallel commands, I only use MPI_Barrier(), MPI_Bcast(), and MPI_Allgatherv().

How do I get valgrind to show the file names and line numbers for the errors it is reporting? Thank you all for your help!

EDIT

I continued working on it and a friend of mine pointed out that the reports without line numbers are all coming from MPI files, which I did not compile from source, and since I did not compile them, I can't use the -g option, and hence, don't see lines. So I tried valgrind again based on this command,

mpirun -np 1 valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log4.txt -v ./Hybrid.out

but now for two processors, which is

mpirun -np 2 valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log4.txt -v ./Hybrid.out

The program ran to completion (I did not see the seg fault reported in the command line) but this execution of valgrind did give me line numbers within my files. The line valgrind is pointing to is a line where I call MPI_Bcast(). Is it safe to say that this appeared because the memory problem only manifests itself on multiple processors (since I've run it successfully on np -1)?

Community
  • 1
  • 1
Eric Inclan
  • 327
  • 1
  • 5
  • 14

1 Answers1

0

It sounds like you are using the wrong tool. If you want to know where a segmentation fault occurs use gdb.

Here's a simple example. This program will segfault at *b=5

// main.c

int
main(int argc, char** argv)
{
   int* b = 0;
   *b = 5;
   return *b;
}

To see what happened using gdb; (the <---- part explains input lines)

svengali ~ % g++ -g -c main.c -o main.o # include debugging symbols in .o file
svengali ~ % g++ main.o -o a.out        # executable is linked (no -g here)
svengali ~ % gdb a.out
GNU gdb (GDB) 7.4.1-debian
<SNIP>
Reading symbols from ~/a.out...done.
(gdb) run <--------------------------------------- RUNS THE PROGRAM
Starting program: ~/a.out 

Program received signal SIGSEGV, Segmentation fault.
0x00000000004005a3 in main (argc=1, argv=0x7fffffffe2d8) at main.c:5
5      *b = 5;
(gdb) bt  <--------------------------------------- PRINTS A BACKTRACE
#0  0x00000000004005a3 in main (argc=1, argv=0x7fffffffe2d8) at main.c:5
(gdb) print b <----------------------------------- EXAMINE THE CONTENTS OF 'b'
$2 = (int *) 0x0
(gdb) 
JRG
  • 2,065
  • 2
  • 14
  • 29