0

I am getting a crash due to memory leak (but it's 7 layers deep down, where it merely walks over the linked list - no allocations there).
It is fairly reproducible, almost on a daily basis, so I can always get a fresh core file. I spent last 3-5 days going over the code, pairing allocation/deallocation but cannot seem to find the place that would cause it, as the legacy C application is huge and it's full of memcpy/alloc/calloc all over the place. Frankly, one wrong memcpy is all it might take.

I went through the effort of compiling Valgrind locally, looking forward to get some nice tracing where it started, but Valgrind just makes the machine inoperable, e.g. it has to be restarted in the server room manually, as even ssh cannot be used. We basically lost two days of debugging due to Valgrind, so I cannot use it third time (unless Memcheck could somehow work with core files, perhaps?)

Is there some other tool that could help me analyze the core file for memory leaks ? gdb with print command is not exactly helpful.

To be more specific, some core files are really huge - 1.5GB (while they should not be over 0.3GB), so I was hoping to get a list of top 2-3 offenders that occupy most memory (which would give me direct hint as to where look next).

Any ideas ?

Oh, and as for the stability - it can properly handle about a million (or so) of the data requests, before it crashes (sometimes couple millions), so just putting a breakpoint in place where it usually crashes is out of the question.

3D Coder
  • 498
  • 1
  • 5
  • 10
  • Take the production sources, go back to the testing site, run the buggy part under valgrind and simulate your traffic until it crashes. Sorry no other advise. In the mean while restart your production system an regulary base know to let the system stay stable. – alk Sep 19 '14 at 19:43

3 Answers3

1

I'd try creating a test set of inputs that brings the system up, runs a number of transactions, and then brings it down in a controlled (i.e. everything should be cleaned up) manner. Run that small suite under valgrind and it should at least give you stuff to chase. If it is an older system, you are likely to have false positives to chase. If you haven't found it by then, you will need to come up with more diverse tests.

BTW, when running the smaller tests, you can limit your process size (ulimit/limit) to avoid the massive memory images and associated system stability issues.

DrC
  • 7,528
  • 1
  • 22
  • 37
  • 1+ for "*you can limit your process size*" – alk Sep 19 '14 at 19:46
  • The problem with creating the crash-inducing set of inputs is that the real-time inputs are external. There is a HW layer that theoretically could host a custom process that would throw a million records at our process, but it might take couple days just to get something up&running there, assuming we could even do it. – 3D Coder Sep 19 '14 at 20:04
  • I'm not saying you have to reproduce the crash. Just create a set of test inputs and see if anything leaks. If timing is a problem that does make it harder. Tough slog ahead of you though. – DrC Sep 19 '14 at 20:06
  • Could the ulimit/limit be used to avoid the scenario where Valgrind brings the machine down ? I just got the email that the access to that machine is severely limited (way more complicated than just going to the server room and push the button). That actually might force us to try to emulate the layer that pushes the data towards our backend. If we did that, we could use any of the plenty other machines that could be easily restarted if Valgrind brought them down. – 3D Coder Sep 19 '14 at 20:22
  • The machine is likely thrashing - too large an active memory space for the available ram. Using ulimit to limit the process size should cause it to crash before running out of ram. Just a guess though - don't really know enough about the situation. – DrC Sep 19 '14 at 21:44
0

I think you're mixing up memory leaks and memory corruption.

If you have a memory leak, eventually a call to malloc() should return NULL, and your program should have code to detect and log that. Unfortunately, it's more likely that malloc() will succeed, but using the memory will cause the OS to OOM-kill your process, which is more difficult to debug. Oh well.

If you have memory corruption (possibly via memcpy(), which will not cause a memory leak), a call to any of the C memory allocation routines may cause the C library to detect the heap corruption and suicide your application. This should come with a diagnostic like "heap corruption detected/invalid next block size" or similar.

The advantage of memory corruption over memory leaks is that an out-of-bounds read/write is unambiguously a bug, while leaking memory can be more subtle.
If valgrind is too slow, a memory corruption can instead be found using AddressSanitizer, which has a much lower overhead.

EOF
  • 6,273
  • 2
  • 26
  • 50
  • I don't think I am mixing those two - but the code written does a lot of memcpy all over the place without any checks whatsoever, so it's entirely possible that both the leaks and corruption are occurring at the same time. I never heard of AddressSanitizer, so I am going to take a look at it - thanks a lot for the hint ! Valgrind, on that 10+yrs old server, is bringing the machine down, so it's useless for me, unless someone knows of a way how to make sure it does not bring the machine down, as the machine restart is virtually inaccessible (as described above). – 3D Coder Sep 23 '14 at 16:14
0

As a core file contains the raw memory dump of the process (embedded in an ELF data structure, which you can pretty much ignore here), you might be able to look at the bulk of the data in the core file, and watch out for repeating patterns and for familiar data (like strings). This is described pretty well in https://stackoverflow.com/a/8714719/2148773 .

Community
  • 1
  • 1
oliver
  • 6,204
  • 9
  • 46
  • 50