15

I have a process (that is started by a watch-dog every time, it's stopped for some reason), that uses usually about 200MB memory. Once I saw it's eating up the memory - with memory usage about 1.5-2GB, which definitely means a "memory leak" somewhere ( "memory leak" in quotes, as that is not a real memory leak - like allocated memory, never freed and unreachable - please note, that only smart pointers are used. So, I think about some huge container (I didn't find) or something like this )

Later, the process crashed, because of the high memory usage and a core dump was generated - about 2GB. But the problem is, that I can't reproduce the issue, so valgrind won't help here (I guess). It happens very rarely and I can't "catch" it.

So, my question is - is there a way, using the exe and the core file, to locate which part of the process, has used most of the memory?

I took a look at the core file with gdb, there's nothing unusual. But the core is big, so there must be something. Is there a clever way to understand what has happened, or only guessing may help (but for such big exe.., 12 threads, about 50-100 (may be more) classes, etc, etc. )

It's a C++ application, running on RHEL5U3.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Kiril Kirov
  • 37,467
  • 22
  • 115
  • 187
  • 1
    Consider trading your coredump tag with 5 followers for the C++ tag with xK? followers. Good luck. – shellter Jan 03 '12 at 13:41
  • I thought about this, but I wondered if it's right. I'll give it a try :) – Kiril Kirov Jan 03 '12 at 14:03
  • @Kiril Kirov -- Try --http://www.outofcore.com/2011/06/scripted-debug-using-gdb/ (I never tried it myself, so posting as comment – Jayan Jan 05 '12 at 03:53
  • Actually, Valgrind can help. You will be looking for reads from uninitialized variables, read after free, and such. Something like that could be happening often, though it rarely triggers your problem. – rleir Apr 03 '16 at 12:38

4 Answers4

12

Open this coredump in hexadecimal format (as bytes/words/dwords/qwords). Starting from the file's middle try to notice any repeating pattern. If anything is found, try to determine starting address and the length of some possible data structure. Using length and contents of this structure, try to guess what might it be. Using the address, try to find some pointer to this structure. Repeat until you come to either stack or some global variable. In case of stack variable, you'll easily know in which function this chain starts. In case of global variable, you know at least its type.

If you cannot find any pattern in the coredump, chances are that leaking structure is very big. Just compare what you see in the file with possible contents of all large structures in the program.

Update

If your coredump has valid call stack, you can start with inspecting its functions. Search for anything unusual. Check if memory allocations near the top of the call stack do not request too much. Check for possible infinite loops in the call stack functions.

Words "only smart pointers are used" frighten me. If significant part of these smart pointers are shared pointers (shared_ptr, intrusive_ptr, ...), instead of searching for huge containers, it is worth to search for shared pointer cycles.

Update 2

Try to determine where your heap ends in the corefile (brk value). Run coredumped process under gdb and use pmap command (from other terminal). gdb should also know this value, but I have no idea how to ask it... If most of the process' memory is above brk, you can limit your search by large memory allocations (most likely, std::vector).

To improve chances of finding leaks in heap area of the existing coredump, some coding may be used (I didn't do it myself, just a theory):

  • Read coredump file, interpreting each value as a pointer (ignore code segment, unaligned values, and pointers to non-heap area). Sort the list, calculate differences of adjacent elements.
  • At this point whole memory is split to many possible structures. Compute a histogram of structure's sizes, drop any insignificant values.
  • Calculate difference of addresses of pointers and structures, where these pointers belong. For each structure size, compute a histogram of pointers' displacement, again drop any insignificant values.
  • Now you have enough information to guess structure types or to construct a directed graph of structures. Find source nodes and cycles of this graph. You can even visualize this graph as in "list “cold” memory areas".

Coredump file is in elf format. Only start and size of data segment is needed from its header. To simplify process, just read it as linear file, ignoring structure.

Community
  • 1
  • 1
Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
  • 2
    Running `strings core` may also provide some clues. – Employed Russian Jan 03 '12 at 15:49
  • Whanks, I'll try with hex editor and with `strings core`..but the core is HUGE - 2GB :x But thanks for the ideas. About the call stack - yes, there's good call stack for each thread (11 threads and the project is HUGE), but there's nothing suspicious there :\ About the pointers - yes, again, most of them are shared ptrs, but I've checked them, everything seems good, too. I hate such problems, when they are not (easily) reproducible :D Thanks again :) – Kiril Kirov Jan 04 '12 at 17:21
  • Nothing interesting with `strings` - lol, about 550k lines :D – Kiril Kirov Jan 04 '12 at 17:57
  • 1
    @Kiril Kirov -- Try --http://www.outofcore.com/2011/06/scripted-debug-using-gdb/ (I never tried it myself, so posting as comment – Jayan Jan 05 '12 at 03:52
  • Thanks, I will try to try this - the update2. +1 for the effort :) – Kiril Kirov Jan 05 '12 at 13:14
3

Once I saw it's eating up the memory - with memory usage about 1.5-2GB

Quite often this would be an end result of an error loop going astray. Something like:

size_t size = 1;
p = malloc(size);
while (!enough_space(size)) {
  size *= 2;
  p = realloc(p, size);
}
// now use p to do whatever

If enough_space() erroneously returns false under some conditions, your process will quickly grow to consume all memory available.

only smart pointers are used

Unless you control all code linked into the process, above statement is false. The error loop could be inside libc, or any other library that you don't own.

only guessing may help

That's pretty much it. Evgeny's answer has good starting points to help you guess.

Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • The most interesting part of your code snippet is infinite loop. It may cause memory leaks in many ways (both in C or C+ code). And it is quite easy to find. – Evgeny Kluev Jan 04 '12 at 09:23
  • No such loops - most of the things are STL. About the pointers - no, I don't control all source of the process, but it's very unlikely the problem to be in `libc`, or `occi`, or in some other 3rd party, widely used library. Not impossible, of course. Thanks :) – Kiril Kirov Jan 04 '12 at 17:22
2

Normal memory allocators don't keep track which part of the process allocated memory - after all, the memory will be freed anyway and pointers are held by the client code. If the memory has truly leaked (i.e. there are no pointers to it left), you have pretty much lost and are looking at a huge block of unstructured memory.

thiton
  • 35,651
  • 4
  • 70
  • 100
  • Hm, I had to mention that in my question, I'll edit. The addition is - we use ONLY smart pointers, so if there's allocated memory, which is not freed, it will be still reachable, for sure. – Kiril Kirov Jan 03 '12 at 10:09
1

Valgrind will likely find several possible errors, and it is worthwhile to analyse all of them. You need to create a suppression file, and use it like this --suppressions=/path/to/file.supp. For each possible error that valgrind flags, either add a clause to the suppression file, or change your program.

Your program will be running slower in Valgrind, and so the timing of events will be different, so you can't be sure of seeing your error occur.

There is a GUI for valgrind called Alleyoop, but I have not used it much.

rleir
  • 791
  • 1
  • 7
  • 19