I've written a large program and I'm having a really hard time tracking down a segmentation fault. I posted a question but I didn't have enough information to go on (see link below - and if you do, note that I spent almost an entire day trying several times to come up with a minimally compilable version of the code that reproduced the error to no avail).
https://stackoverflow.com/questions/16025411/phantom-bug-involving-stdvectors-mpi-c
So now I'm trying my hand at valgrind for the first time. I just installed it (simply "sudo apt-get install valgrind") with no special installation to account for MPI (if there is any). I'm hoping for concrete information including file names and line numbers (I understand it's impossible for valgrind to provide variable names). While I am getting useful information, including
- Invalid read of size 4
- Conditional jump or move depends on uninitialised value(s)
- Uninitialised value was created by a stack allocation
- 4 bytes in 1 blocks are definitely lost
in addition to this magical thing
- Syscall param sched_setaffinity(mask) points to unaddressable byte(s) at 0x433CE77: syscall (syscall.S:31) Address 0x0 is not stack'd, malloc'd or (recently) free'd
I am not getting file names and line numbers. Instead, I get
==15095== by 0x406909A: ??? (in /usr/lib/openmpi/lib/libopen-rte.so.0.0.0)
Here's how I compile my code:
mpic++ -Wall -Wextra -g -O0 -o Hybrid.out (…file names)
Here are two ways I've executed valgrind:
valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log.txt mpirun -np 1 Hybrid.out
and
mpirun -np 1 valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log4.txt -v ./Hybrid.out
The second version based on instructions in
Segmentation faults occur when I run a parallel program with Open MPI
which, if I'm understanding the chosen answer correctly, appears to be contradicted by
openmpi with valgrind (can I compile with MPI in Ubuntu distro?)
I am deliberately running valgrind on one processor because that's the only way my program will execute to completion without the segmentation fault. I have also run it with two processors, and my program seg faulted as expected, but the log I got back from valgrind seemed to contain essentially the same information. I'm hoping that by resolving the issues valgrind reports on one processor, I'll magically solve the issue happening on more than one.
I tried to include "-static" in the program compilation as suggested in
Valgrind not showing line numbers in spite of -g flag (on Ubuntu 11.10/VirtualBox)
but the compilation failed, saying (in addition to several warnings)
dynamic STT_GNU_IFUNC symbol "strcmp" with pointer equality in '…' can not be used when making an executably; recompile with fPIE and relink with -pie
I have not looked into what "fPIE" and "-pie" mean. Also, please note that I am not using a makefile, nor do I currently know how to write one.
A few more notes: My code does not use the commands malloc, calloc, or new. I'm working entirely with std::vector; no C arrays. I do use commands like .resize(), .insert(), .erase(), and .pop_back(). My code also passes vectors to functions by reference and constant reference. As for parallel commands, I only use MPI_Barrier(), MPI_Bcast(), and MPI_Allgatherv().
How do I get valgrind to show the file names and line numbers for the errors it is reporting? Thank you all for your help!
EDIT
I continued working on it and a friend of mine pointed out that the reports without line numbers are all coming from MPI files, which I did not compile from source, and since I did not compile them, I can't use the -g option, and hence, don't see lines. So I tried valgrind again based on this command,
mpirun -np 1 valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log4.txt -v ./Hybrid.out
but now for two processors, which is
mpirun -np 2 valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log4.txt -v ./Hybrid.out
The program ran to completion (I did not see the seg fault reported in the command line) but this execution of valgrind did give me line numbers within my files. The line valgrind is pointing to is a line where I call MPI_Bcast(). Is it safe to say that this appeared because the memory problem only manifests itself on multiple processors (since I've run it successfully on np -1)?