9

As far as I know, there are three reasons why a std::bad_alloc can be thrown:

  1. The process requests more memory than what can be served
  2. The address space is too fragmented to serve a request for a large chunk of contiguous memory
  3. The heap management datastructure is corrupted

We have code which runs into a std::bad_alloc, but none of the above reasons seem to apply. The datastructure is a graph stored as a std::list of vertices, where each vertex stores again a std::list of the edges of which it is part of as well as some amount of contiguous data.

For small graphs (<= 100'000 vertices), the program runs perfectly fine, irrespective of how large the data sections per vertex are (we can without problems allocate up to 40 GB in total). If the number of vertices grows larger, however, we get a std::bad_alloc exception thrown even on instances using "only" 8 GB of memory.

Since there are no problems when allocating more memory in larger blocks, above reasons 1. and 2. should be ruled out. There are sections where we play around with pointers in a quite error prone way, so it is possible that we might corrupt the heap datastructure. But when run on smaller instances, valgrind's memcheck reports our code as flawless, so that reason seems unlikely as well (on throwing instances, valgrind itself runs out of memory, so we cannot check that case directly).

Are there any ideas on what else could be the reason for this behaviour, or what tests we might run to further pin down the problem?

OS: Fedora 19
Build system: cmake with gcc 4.8.2

gTcV
  • 2,446
  • 1
  • 15
  • 32
  • 1
    Don't forget the 16-20 bytes a memory manager typically adds to each allocation in your calculations. – Billy ONeal Jan 30 '14 at 17:40
  • Technically there is only one reason why `bad_alloc` could be thrown: failure to allocate memory. – pmr Jan 30 '14 at 17:41
  • Implement your own operator new in terms of malloc, link against a suitable debugging malloc library, fire up gdb and stock up on elbow grease. – n. m. could be an AI Jan 30 '14 at 17:46
  • 1
    You might watch signed sizes –  Jan 30 '14 at 17:55
  • @pmr Or undefined behavior somewhere in your code. – James Kanze Jan 30 '14 at 18:52
  • Do your larger graphs possibly have loops? How big is your vertex? I assume this is a 64-bit executable? Which library implementation are you using? Does the constructor for a vertex do additional allocations? Could one of those be out of whack? Have you tried any allocation-tracking tools? – Adrian McCarthy Jan 30 '14 at 19:29
  • @BillyONeal: The reported memory footprints are what htop reports me, so no calculations involved. – gTcV Jan 31 '14 at 07:31
  • @n.m.: Any suggestions for a debugging malloc library? Also, what should I expect to happen if I do what you proposed? – gTcV Jan 31 '14 at 07:45
  • @DieterLücking: I used std::size_t for all my size types. Also, depending on the input the bad_alloc gets thrown at different places in the code, which indicates that it really is an allocation problem and not a bug in the code. – gTcV Jan 31 '14 at 07:45
  • @AdrianMcCarthy: No, the graph is a tree. The vertex contains the std::list of the edges, three pointers, four std::vectors of pointers/integers of length <= 3 as well as a variably sized field of doubles. The size of the latter can range from a single number to a few million numbers, and the bad_alloc occurs on both extremes of the data field size (although the tree needs to be larger for smaller data fields). The std library implementation is GNU ISO C++ Library, but I could not figure out which version (any help on how to get that?). As I said, I tried Valgrind's memcheck. – gTcV Jan 31 '14 at 08:16
  • `dmalloc` is one such library. You can also implement your own by wrapping system malloc. You can then put a breakpoint inside your operator new when it's about throw an exception. When it hets hit, run heap checks to make sure heap is not corrupted, and analyse the amount of memory consumed and its fragmentation. – n. m. could be an AI Jan 31 '14 at 11:05
  • Thanks for the hint. I tried using dmalloc, so I downloaded, compiled, ran tests and boom, segmentation fault. Running it under gdb hints that it gets trapped in an infinite recursion. I have read a bit about dmalloc, and I am not sure whether it can do much more than what valgrind's memcheck does. Therefore, I am unsure whether it is worth the trouble of making dmalloc work. – gTcV Feb 01 '14 at 11:09
  • How can I do a heap check myself, given that I don't know how the OS/library manages the heap? Also, I stopped the executable right before it exits and looked at the memory map at /proc/[pid]/maps. I was surprised to find that the executable doesn't just have one memory region allocated to the heap, but rather 34'000 different ones. Doing the same thing not on the compute server but my local machine, I get just one large heap. My feelings tell me that this could cause troubles, but I know too little about memory management to tell for sure. – gTcV Feb 01 '14 at 11:15
  • With dmalloc, you enable the check-heap debugging token. By default, this will cause it to traverse and validate the internal heap data structures every time an allocation is made. This can be slow, so there's an option to do it periodically instead, every Nth allocation. – n. m. could be an AI Feb 01 '14 at 19:11
  • If you want to wrap system malloc instead, when you get a request to allocate N bytes, get `N + 2*sizeof(void*) + 2 * k` from the system malloc, for some k (perhaps k==8 or k==16). Use two pointers to build a doubly linked list of allocated regions. Use `2*k` extra bytes to pad the region at the beginning and at the end. Write some "signature" (a bit pattern you choose) to the padded space. Now you can at any time walk the linked list and check if the signatures are intact. If not, you have a buffer overrun or a similar error somewhere. – n. m. could be an AI Feb 01 '14 at 19:16
  • The fence post check is also done by memcheck, so for that I should not have to try anything new. Only the heap consistency checking would be useful, but at the moment I don't feel like fixing yet another broken program. Thanks for your help anyway, though! Thanks to you, I've learned that malloc implementations are exchangeable, and once I find the time I might try another implementation than the compiler default one and see whether the problem persists. – gTcV Feb 03 '14 at 17:07
  • maybe GPU driver problem? – BЈовић Feb 10 '14 at 09:40
  • Hardly. I don't use GPUs and don't even know whether there are any present on the system I use. It is a compute server without (to my knowledge) any graphical front end. – gTcV Feb 10 '14 at 10:36

1 Answers1

6

I cannot comment on your post, so I'll put it in reply.

I came across the same problem while using OpenFST with Kaldi (same system and gcc as yours). I didn't track the exact origin of this issue, but it seems, that kernel 3.12 is the issue here. I booted with one of the backup kernels (one of 3.11 series) and the problem was gone.

You can use:

yum list --showduplicates kernel

to find available 3.11 kernel.

EDIT:

It seems, that this bug is fixed in Kernel 3.12.11-201 and in 3.13+

Source: Bugzilla

tvar0g
  • 76
  • 5
  • We run our code on a different machine with the 3.10.11 kernel and it worked! Thanks a lot for your hint, we probably would never have come up with the idea that it might be a kernel bug! – gTcV Feb 11 '14 at 14:01
  • glad I helped, it took me some time to get there too ;) – tvar0g Feb 11 '14 at 14:28