1

I am not a c++ programmer but try to debug some complex code. Not the best preconditions, I know...

So I have an openfoam solver which uses (includes) lots of code and I am struggling to really find the error. I compile with

SOURCE=mySolver.C ; g++ -m64 -Dlinux64 -DWM_DP -Wall -Wextra -Wno-unused-parameter -Wold-style-cast -O3 -DNoRepository -ftemplate-depth-100 -I/opt/software/openfoam/OpenFOAM-2.0.5/src/dynamicMesh/lnInclude {more linking} -I. -fPIC -c $SOURCE -o Make/linux64Gcc46DPOpt/mySolver.o

and after running the solver with the appropriate options, it crashes at the end after (or while) my return statement:

BEFORE return 0

*** glibc detected *** /opt/software/openfoam/myLibs/applications/bin/linux64Gcc46DPOpt/mySolver: double free or corruption (!prev): 0x000000000d3b7c30 ***
======= Backtrace: =========
/lib64/libc.so.6[0x31c307230f]
/lib64/libc.so.6(cfree+0x4b)[0x31c307276b]
/opt/software/openfoam/ThirdParty-2.0.5/platforms/linux64/gcc-4.5.3/lib64/libstdc++.so.6(_ZNSsD1Ev+0x39)[0x2b34781ffff9]
/opt/software/openfoam/myLibs/applications/bin/linux64Gcc46DPOpt/mySolver(_ZN4Foam6stringD1Ev+0x18)[0x441e2e]
/opt/software/openfoam/myLibs/applications/bin/linux64Gcc46DPOpt/mySolver(_ZN4Foam4wordD2Ev+0x18)[0x442216]
/lib64/libc.so.6(__cxa_finalize+0x8e)[0x31c303368e]
/opt/software/openfoam/myLibs/lib/linux64Gcc46DPOpt/libTMP.so[0x2b347a17f866]
======= Memory map: ========
...

My solver looks like (sorry, I can't post all parts):

#include "stuff1.H"
#include "stuff2.H"

int main(int argc, char *argv[])
{
#include "stuff3.H"
#include "stuffn.H"

    while (runTime.run())
    {

        ...

    }

Info<< "BEFORE return 0\n" << endl;

return(0);
}

Running the solver with gdb with setting set environment MALLOC_CHECK_ 2 yields to:

BEFORE return 0

Program received signal SIGABRT, Aborted.
0x00000031c3030265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00000031c3030265 in raise () from /lib64/libc.so.6
#1  0x00000031c3031d10 in abort () from /lib64/libc.so.6
#2  0x00000031c3075ebc in free_check () from /lib64/libc.so.6
#3  0x00000031c30727f1 in free () from /lib64/libc.so.6
#4  0x00002aaab0496ff9 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() ()
   from /opt/software/openfoam/ThirdParty-2.0.5/platforms/linux64/gcc-4.5.3/lib64/libstdc++.so.6
#5  0x0000000000441e2e in Foam::string::~string (this=0x2aaaac0bd3c8, __in_chrg=<value optimized out>) at /opt/software/openfoam/OpenFOAM-2.0.5/src/OpenFOAM/lnInclude/string.H:78
#6  0x0000000000442216 in Foam::word::~word (this=0x2aaaac0bd3c8, __in_chrg=<value optimized out>) at /opt/software/openfoam/OpenFOAM-2.0.5/src/OpenFOAM/lnInclude/word.H:63
#7  0x00000031c303368e in __cxa_finalize () from /lib64/libc.so.6
#8  0x00002aaab2416866 in __do_global_dtors_aux () from /opt/software/openfoam/myLibs/lib/linux64Gcc46DPOpt/libTMP.so
#9  0x0000000000000000 in ?? ()
(gdb) 

How should I proceed to find the real source of my error?

Btw. I saw this and this which is similar but not solving my issue. Also valgrind isn't working correctly for me. I know it has to do with some wrong (de-)allocation but I don't know how to really find the problem.

/Edit

I wasn't able to locate my problem yet...

I think the backtrace which I posted above (position #8) shows the problem is in the code which compiles to libTMP.so. In the Make/options file I added the option -DFULLDEBUG -g -O0. I thought it's possible to track the bug then but I don't know how.

Any help is highly appreciated!

Community
  • 1
  • 1
EverythingRightPlace
  • 1,197
  • 12
  • 33
  • If you are using new/new[] are you calling delete/delete[]? – NathanOliver Feb 05 '15 at 14:19
  • @Nathan the solver includes a ton of other files. I didn't write the code so I don't want to search by hand. I think there are debuggers made for this purpose. But how can I find the issue (e.g. missing delete statements)? – EverythingRightPlace Feb 05 '15 at 14:21
  • 1
    The trace suggests that the corruption occurs in the d'tor of a global object of type `Foam::word`. I would look for places where those are declared, defined or (mis-)used. – eerorika Feb 05 '15 at 14:28
  • 6
    Someone is taking the internal buffer of a `std::string` (which is alo used by a `Foam::string` and a `Foam::word` and calling `delete` or `delete[]` on it via a `char*`. `std::string` manages its own buffer, so that `delete[]` is incorrect. When the `std::string` is destroyed, it also tries to clean up its buffer, but finds that someone else did it first. The problem occurred long before the cleanup of the `std::string`, and it notifies you that things have gone wrong. This is good, because if unlucky, worse things can happen *and it wouldn't notice them happening*. – Yakk - Adam Nevraumont Feb 05 '15 at 15:15
  • 1
    Compile with `g++ -Wall -Wextra -g` then use [valgrind](http://valgrind.org/) and the `gdb` debugger – Basile Starynkevitch Feb 09 '15 at 13:36
  • How can you `return(0);`? `return` is not a function – ForceBru Feb 09 '15 at 13:40
  • 1
    @ForceBru: you can `return` a parenthesised expression (even if the parenthesis are useless). – Basile Starynkevitch Feb 09 '15 at 13:42
  • @BasileStarynkevitch I used `gdb` and posted my *results* already. I guess I am using valgrind not correctly. I execute e.g. `gdb --args /path/to/solver -flags` and do the same with `valgrind /path/to/solver -flags` which yields for valgrind to: `--11945-- Warning: DWARF2 CFI reader: unhandled DW_OP_ opcode 0x2a valgrind: m_debuginfo/readdwarf.c:2204 (copy_convert_CfiExpr_tree): Assertion 'srcix >= 0 && srcix < VG_(sizeXA)(srcxa)' failed.` – EverythingRightPlace Feb 09 '15 at 13:42
  • @EverythingRightPlace: you should compile *without* `-O3` and just with `-g` ; BTW, your GCC 4.5 is ancient, please upgrade your GCC compiler (to e.g. 4.9.2 at least) and upgrade and recompile your `openfoam` software – Basile Starynkevitch Feb 09 '15 at 13:43
  • As Yakk suggested you could try to trace the lifetime of your Foam:word/Foam:string objects which are used in your while() loop. Either there is some "delete" statement or one of the destructors (~ClassName()) is tampering with the strings. – Sorin Totuarez Feb 10 '15 at 15:58
  • @SorinTotuarez There is no Foam::word/Foam::string in the solver code. Maybe it is **somewhere** else (the code is very big and consists of a lot of files). Isn't there a possibility to get a deeper view in the backtrace of gdb. As far as I understand the error occurs in the `libTMP.so` (see backtrace #8). – EverythingRightPlace Feb 11 '15 at 06:58
  • `#include "stuff3.H"` within `main()` is suspicious – M.M Feb 12 '15 at 00:14
  • @MattMcNabb there is just plain code included which itself has no header. Maybe this is not the best style but it shouldn't do any harm (just like pasting code in there). – EverythingRightPlace Feb 16 '15 at 10:55

5 Answers5

5

If you have dealt with all compiler warnings and valgrind errors but the problem remains, then Divide and conquer.

Cut out half of the code (use #if directives, remove files from Makefile, or delete lines and restore later using source control).

If the problem goes away then it's likely that it was caused by something you just removed. Or if the problem remains then it's certainly in the code that still remains.

Repeat procedure recursively until you hone in on the problem location.

This doesn't always work because undefined behaviour can manifest itself at a later time than the line which caused it.

However you can work towards producing a minimal program that still has the problem. Eventually you must either produce an actual minimal example that you cannot reduce further, or uncover the true cause.

Community
  • 1
  • 1
M.M
  • 138,810
  • 21
  • 208
  • 365
3

If you havn't got anything concrete after using gdb and valgrind I think what you can try is disassemble your so libraray using objdump, as you can see in backtrace it has given you the address of the errors, I had tried this kind of approach a long back in my project while debugging a problem. After disassemble you match the address of error to the address of statement in your library, it might give you an idea about the error location. The command for disassembling objdump -dR <library.so>

You can find more information about objdump here

Tejendra
  • 1,874
  • 1
  • 20
  • 32
  • I like the idea and had a look at the output of **objdump**. The adress in `#8 0x00002aaab2416866 in __do_global_dtors_aux () from /opt/software/openfoam/myLibs/lib/linux64Gcc46DPOpt/libTMP.so` isn't output in objdump. I also searched for do_global_dtors which gives me *something*: `00000000001463c0 <__do_global_dtors_aux>: 1463c0: 55 push %rbp` and so on ... how to proceed? – EverythingRightPlace Feb 16 '15 at 10:50
2

valgrind

Ok, I risk being shot down for a one-word answer, but bear with me. Try valgrind. Build the most debug version you have that still has issues and simply issue:

valgrind path/to/program

Chances are, the first reported issue will be your problem source. You can even get valgrind to launch a gdb server and let you attach to debug the code leading to the first memory issue. See:

http://tromey.com/blog/?s=valgrind

Jeffrey
  • 11,063
  • 1
  • 21
  • 42
2

Some other options that were not listed yet are:

You can try gdb execution flow recording capability:

$ gdb target_executable
(gdb) b main
(gdb) run
(gdb) target record-full
(gdb) set record full insn-number-max unlimited

Then when the program crashes, you will be able to execute flow backward with reverse-next and reverse-step commands. Note that program runs really slow in this mode.

Another possible approach is to try clang static analyzer or clang-check tools on your code. Sometimes analyzer can give a good hint where problem in code might be.

Also, you can link your code with jemalloc and use it's debugging capabilities. Options "opt.junk", "opt.quarantine", "opt.valgrind" and "opt.redzone" can be usefull. In general, it makes malloc allocate some additional memory that is used to monitor writes and reads after the end of buffers, reads of deallocated memory and so on. See man page. This options can be enabled with mallctl function.

One more way to find a bug is to build your code with gcc's or clang's sanitizers enabled. You can turn them on with -fsanitize="sanitizer", where "sanitizer" can be one of: address, thread, leak, undefined. Compiler will instrument application with some additional code that will do additional checks and will print the report. For example:

#include <vector>
#include <iostream>

int main() {
  std::vector<int> vect;
  vect.resize(5);
  std::cout << vect[10] << std::endl; // access the element after the end of vector internal buffer
}

Compile it with sanitizer turned on and run:

$ clang++ -fsanitize=address test.cpp
$ ./a.out

Gives the output:

==29920==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60400000dff8 at pc 0x0000004bad10 bp 0x7fff16d63e10 sp 0x7fff16d63e08
READ of size 4 at 0x60400000dff8 thread T0
#0 0x4bad0f in main (/home/pablo/a.out+0x4bad0f)
#1 0x7f0b6ce43fdf in __libc_start_main (/lib64/libc.so.6+0x1ffdf)
#2 0x4baaac in _start (/home/pablo/a.out+0x4baaac)

0x60400000dff8 is located 0 bytes to the right of 40-byte region [0x60400000dfd0,0x60400000dff8)
allocated by thread T0 here:
#0 0x435b9b in operator new(unsigned long) (/home/pablo/a.out+0x435b9b)
#1 0x4c1f49 in __gnu_cxx::new_allocator<int>::allocate(unsigned long, void const*) (/home/pablo/a.out+0x4c1f49)
#2 0x4c1d05 in __gnu_cxx::__alloc_traits<std::allocator<int> >::allocate(std::allocator<int>&, unsigned long) (/home/pablo/a.out+0x4c1d05)
#3 0x4bfd51 in std::_Vector_base<int, std::allocator<int> >::_M_allocate(unsigned long) (/home/pablo/a.out+0x4bfd51)
#4 0x4bdb2a in std::vector<int, std::allocator<int> >::_M_fill_insert(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, unsigned long, int const&) (/home/pablo/a.out+0x4bdb2a)
#5 0x4bbe49 in std::vector<int, std::allocator<int> >::insert(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, unsigned long, int const&) (/home/pablo/a.out+0x4bbe49)
#6 0x4bb358 in std::vector<int, std::allocator<int> >::resize(unsigned long, int) (/home/pablo/a.out+0x4bb358)
#7 0x4bacaa in main (/home/pablo/a.out+0x4bacaa)
#8 0x7f0b6ce43fdf in __libc_start_main (/lib64/libc.so.6+0x1ffdf)
Pavel Davydov
  • 3,379
  • 3
  • 28
  • 41
  • While using `GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2)` there is no `target record-full`. I tried `target record` and afterwards: `set record insn-number-max unlimited` which gives following error: `No symbol "unlimited" in current context.` – EverythingRightPlace Feb 13 '15 at 12:40
  • @EverythingRightPlace You can just do `record full` and then `target record` instead of `target record-full`. Try to set `insn-number-max` to 0, it must be the same as unlimited. – Pavel Davydov Feb 13 '15 at 13:32
  • @EverythingRightPlace Try to update gdb if you can, mine is `GNU gdb (GDB) Fedora 7.8.2-38.fc21` – Pavel Davydov Feb 13 '15 at 13:41
1

I partially agree with Matt: divide et impera is the way. But I partially agree 'cause I partially disagree: modifiyng the code you are trying to debug can lead you to hunt on the wrong track, even more if you are trying to debug a huge and complex code not yours in a language that you don't master.

Instead, follow a divide et impera method coupled with a top to bottom strategy: start by adding a few breakpoints in code at a higher level, let's say in the main, then launch the program and see which breackpoints get hitten and which not before crashing. Now you have a general idea of where the bug is; remove all breakpoints and add new ones a little bit deeper, in the area you just found, and repeat until you hit the routine that cause the crash.

It can be tedious, I know, but it works and, moreover, while doing so it will give you a much much better understanding of how the entire system works. I've fixed bugs in unknown applications made of tens of thousands lines of code in this way, and it always works; maybe it can take an entire day, but it works.

motoDrizzt
  • 1,082
  • 11
  • 23
  • But for my problem lies in the **.so** which I can't breakpoint into. Or how would you proceed for my specific issue (please see the gdb output)? – EverythingRightPlace Feb 16 '15 at 10:10
  • I'd downvote my answer, If I could; sorry, I did misunderstood, I thought it was a .so from your own project, not from external sources. How to proceed now...it's a bit over my capabilities, i fear. Probably, as OpenFoam is opensource, I'd download the source and debug it. – motoDrizzt Feb 16 '15 at 10:44
  • I have the code but it is very complex and I am not used to c++. So *debugging it* is easier said than done as I noticed :( – EverythingRightPlace Feb 16 '15 at 10:46
  • @EverythingRightPlace the problem is *detected* in the .so as a double free OR CORRUPTION , but the cause is the first free, or a "stray write to memory" which has occurred earlier. – JulianSymes Feb 16 '15 at 11:26