4

I have a multithreaded program running which crashes after a day or two. Moreover the gdb backtrace of the core dump does not lead anywhere. There are no symbols at the point where it crashes.

Now the machine that generates the core file has a physical memory of 3 Gigs and 5 Gigs swap space. But the core dump that we get is around 25 Gigs. Isn't the core dump actually memory dump? Why is the core dump large?

And can anyone give me more lead on how to debug in such situation?

Rituparna Kashyap
  • 1,497
  • 13
  • 19

2 Answers2

4

If you are running a 64-bit OS then you can have file-backed mappings that exceed many times the amount of available physical memory + swap space.

Since kernel version 2.6.23, Linux provides a mechanism to control what gets included in the core dump file, called core dump filter. The value of the filter is a bit-field manipulated via the /proc/<pid>/coredump_filter file (see core(5) man page):

  • bit 0 (0x01) - anonymous private mappings (e.g. dynamically allocated memory)
  • bit 1 (0x02) - anonymous shared mappings
  • bit 2 (0x04) - file-backed private mappings
  • bit 3 (0x08) - file-backed shared mappings (e.g. shared libraries)
  • bit 4 (0x10) - ELF headers
  • bit 5 (0x20) - private huge pages
  • bit 6 (0x40) - shared huge pages

The default value is 0x33 which corresponds to dumping all anonymous mappings as well as the ELF headers (but only if kernel is compiled with CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS) and the private huge pages. Reading from this file returns the hexadecimal value of the filter. Writing a new hexadecimal value to coredump_filter changes the filter for the particular process, e.g. to enable dump of all possible mappings one would:

echo 0x7f > /proc/<pid>/coredump_filter

(where <pid> is the PID of the process)

The value of the core dump filter is iherited in child processes created by fork().

Some Linux distributions might change the filter value for the init process early in the OS boot stage, e.g. to enable dumping the file-backed mappings. This would then affect any process started later.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • `cat /proc//coredump_filter` gives 00000003. File-backed private/shared mapping is not enabled, right? Can you give me any link which explains file-backed mapping. Also, I am running on 64-Bit OS – Rituparna Kashyap Jul 31 '12 at 09:19
  • See the `mmap(2)` man page. But probably your program of some library does very large virtual allocations that are not touched, e.g. `malloc(16GiB)` then only touch a small portion of it. – Hristo Iliev Jul 31 '12 at 10:15
  • It is a stress test so even if, it is not one 16GB malloc, it might be 1000's of 16MB malloc. :) Can you tell me this. Current the program is running with VIRT 11.7 GB and RES is 2.2 GB. Swap is almost free (253312k used of 5079032k total). So rest (11.7-2.2) GB is some file backed memory mapped. – Rituparna Kashyap Jul 31 '12 at 11:05
  • 1
    If you do `a = malloc(16MiB); a[0] = 0;` then you'd end up with 16 MiB more to `VIRT` and 4 KiB more to `RES` (assuming page size of 4 KiB). If you do it 1024 times without freeing `a` then your virtual memory usage would increase by 16 GiB while your resident set size would only increase by 4 MiB. The difference is allocated but uncommitted memory. – Hristo Iliev Jul 31 '12 at 11:19
  • What can be the reason of low swap space usage. As mentioned in my previous comment. – Rituparna Kashyap Jul 31 '12 at 14:16
  • 3
    I believe I've already explained what the reason is - non-committed virtual memory uses neither physical memory pages nor swap space. With memory overcommitment heuristics enabled on Linux (default) you can allocate more than the combined size of the physical memory and the swap and will only run into trouble once you actually touch more pages than RAM + swap can provide (in short - you'll have a close encounter with the Linux OOM killer). – Hristo Iliev Jul 31 '12 at 14:39
  • @HristoIliev: I think your explanations of what the different memory 'types' (such as non-commited) represent were very interesting. Do you happen to know some good paper on this topic? I wonder how you learned all of this. :-) – Frerich Raabe Aug 01 '12 at 06:50
  • @FrerichRaabe, [Understanding the Linux Virtual Memory Manager](http://kernel.org/doc/gorman/html/understand/) provides plenty of information on the topic. But be warned - it's very hardcore technical reading, not for the faint of heart. – Hristo Iliev Aug 01 '12 at 13:01
2

A core dump contains more than just the state of the memory of the process. See the answer at https://stackoverflow.com/a/5321564/91757 for examples of other information included in the core dump (on Linux).

Community
  • 1
  • 1
Frerich Raabe
  • 90,689
  • 19
  • 115
  • 207
  • In particular, it includes the contents of any files [mapped](http://www.kernel.org/doc/man-pages/online/pages/man2/mmap.2.html) into the process's address space. That'll be at least the executable and all the shared libraries it uses, and the application may have mapped other files as well. – Wyzard Jul 31 '12 at 07:23
  • But 17 GiB more... isn't that just a bit too much more? – Hristo Iliev Jul 31 '12 at 07:29
  • Maybe it has some large files mapped. Check `/proc/${pid}/maps` while the program is running. – Wyzard Aug 01 '12 at 00:46