12

I have an image compression application that now has two different versions of memory allocation systems. In the original one, malloc is used everywhere, and in the second one, I implemented a simple pool-allocator, that just allocates chunk of memory and returns parts of that memory to myalloc() calls.

We've been noticing a huge memory overhead when malloc is used: At the height of its memory usage, the malloc() code requires about 170 megabytes of memory for a 1920x1080x16bpp image, while the pool allocator allocates just 48 megabytes, of which 47 are used by the program.

In terms of memory allocation patterns, the program allocates a lot of 8byte(most), 32-byte(many) and 1080byte-blocks(some) with the test image. Apart from these, there are no dynamic memory allocations in the code.

The OS of the testing system is Windows 7 (64 Bit).

How did we test memory usage?

With the custom allocator, we could see how much memory is used because all malloc calls are defered to the allocator. With malloc(), in Debug mode we just stepped through the code and watched the memory usage in the task manager. In release mode we did the same, but less fine grained because the compiler optimizes a lot of stuff away so we couldn't step through the code piece by piece (the memory difference between release and debug was about 20MB, which I would attribute to optimization and lack of debug information in release mode).

Could malloc alone be the cause of such a huge overhead? If so, what exactly causes this overhead inside malloc?

TravisG
  • 2,373
  • 2
  • 30
  • 47
  • Normally you only write a custom version of something generic when what you write will add a lot of benefit in performance due to knowing the specifics of how you plan to use it. However I wouldn't expect a memory overhead for using malloc. Are you sure you are measuring memory use correctly? Are you sure you are freeing the memory correctly when you use malloc? – CashCow Oct 25 '12 at 08:51
  • Everything in the malloc code is freed (I tested this with a memory profiler), but only at the very end of the application before it has finished executing, so the measurements occur before any free() functions are called (in both versions). The custom allocator speeds the entire thing up, and saves us about 15ms per image (since it's just one large allocation instead of lots of small ones). – TravisG Oct 25 '12 at 08:57
  • A 200% overhead for `malloc` does seem excessive, unless there are a lot more of the 8 byte allocations than you think. – Steve Jessop Oct 25 '12 at 08:57
  • malloc may well "retain" some memory after you've freed it for immediate use. If your implementation is using std::vector you can "reserve" ahead, although when you are going to allocate such large amounts of memory, it is better not to go for a model that requires a contiguous buffer. – CashCow Oct 25 '12 at 09:00
  • 1
    Can you describe how did you guys measure the memory usage? – UmNyobe Oct 25 '12 at 09:04
  • @UmNyobe With the custom allocator, we could see how much memory is used because all malloc calls are defered to the allocator. With malloc(), in Debug mode we just stepped through the code and watched the memory usage in the task manager. In release mode we did the same, but less fine grained because the compiler optimizes a lot of stuff away so we couldn't step through the code piece by piece (the memory difference between release and debug was about 20MB, which I would attribute to optimization and lack of debug information in release mode). – TravisG Oct 25 '12 at 09:06
  • thanks. Edit the question to put this information... – UmNyobe Oct 25 '12 at 09:08
  • Good idea! Edited it in. – TravisG Oct 25 '12 at 09:12

3 Answers3

8

On Windows 7 you will always get the low-fragmentation heap allocator, without explicitly calling HeapSetInformation() to ask for it. That allocator sacrifices virtual memory space to reduce fragmentation. Your program is not actually using 170 megabytes, you are just seeing a bunch of free blocks lying around, waiting for an allocation of a similar size.

This algorithm is very easy to beat with a custom allocator that doesn't do anything to reduce fragmentation. Which may well work out for you, albeit that you don't see the side effects of it until you keep the program running longer than a single debug session. You do need to make sure it is stable for days or weeks if that is the expected usage pattern.

Best thing to do is just not fret about it, 170 MB is rather small potatoes. And do keep in mind that this is virtual memory, it doesn't cost anything.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • 2
    "On Windows 7 you will always get the low-fragmentation heap allocator" You will, unless you are running under a debugger. Based on the methodology desribed (stepping through the code) it seems the debugger was used when measuring the memory consumption, therefore LFH was not used. See my answer. – Suma Oct 25 '12 at 09:46
  • 1
    That article quotes ancient history, the memory manager was significantly revised in Vista. But yes, the block padding used by the heap debugging feature can affect the amount of VM he sees getting used. – Hans Passant Oct 25 '12 at 09:58
  • Virtual memory costs something, page misses are painful in performance. – Fingolfin Oct 25 '12 at 09:58
  • No, you can't get a page miss from accessing a free block. That would be a bug. – Hans Passant Oct 25 '12 at 10:00
  • @Hans Maybe I got it wrong then, can you please explain this further? – Fingolfin Oct 25 '12 at 10:01
  • Virtual memory is allocated by pages of 4k at least; unless we are talking about segmented memory management without page tables, which probably isn't supported by any OS today. – Aki Suihkonen Oct 25 '12 at 16:38
6

First at all malloc aligns the pointers to 16 byte boundaries. Furthermore they store at least one pointer (or allocated length) in the addresses preceding the returned value. Then they probably add a magic value or release counter to indicate that the linked list is not broken or that the memory block has not been released twice (free ASSERTS for double frees).

#include <stdlib.h>
#include <stdio.h>

int main(int ac, char**av)
{
  int *foo = malloc(4);
  int *bar = malloc(4);
  printf("%d\n", (int)bar - (int)foo);
}

Return: 32

Paul Floyd
  • 5,530
  • 5
  • 29
  • 43
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • You mean 32 bits takes 16 bytes? – TravisG Oct 25 '12 at 09:11
  • @Aki he says many 32-byte allocations. why do you come up with 32-bits allocations? – UmNyobe Oct 25 '12 at 09:15
  • This is tested in 64-bit ubuntu. In 32-bit systems it takes most likely 16 bytes. – Aki Suihkonen Oct 25 '12 at 09:16
  • 2
    @UmNyobe: apparently my mistake. Anyway 8 byte malloc takes 32 bytes and **32-byte** malloc takes then at least 48 bytes. – Aki Suihkonen Oct 25 '12 at 09:20
  • Thanks. The alignment might explain a lot of the memory loss. Aki, do you have any sources for this information? – TravisG Oct 25 '12 at 09:21
  • http://www.gnu.org/software/libc/manual/html_node/Aligned-Memory-Blocks.html then it's depends on the implementation of malloc: http://bardofschool.blogspot.com/2009/09/glibc-malloc-implementation.html discusses that because malloc has to avoid fragmentation and has to be fast to search for proper size of free blocks, they might implement several mechanisms for different sized allocation units. – Aki Suihkonen Oct 25 '12 at 09:35
  • I'll accept this since it was the cause of our problem. The initial implementor of the prototype code I'm using right now allocated a lot of 8 byte blocks inside a loop, which could be rewritten so it allocates larger chunks in one batch, which in turn reduced mallocs memory usage down to just 10 MB or so more than my custom allocator. – TravisG Oct 25 '12 at 10:10
  • 1
    A `malloc` implementation certainly doesn't have to add headers. Modern allocator implementations like jemalloc have very low metadata overhead even for small allocations (~2%). By using aligned chunks, small (< chunk) allocations be distinguished from large (>= chunk) ones via an alignment check (4M chunks in jemalloc). The metadata for small allocations can be found via an offset with a global data structure for large ones. Alternatively, a global data structure can be used which is what TCMalloc does. – strcat Jan 01 '15 at 17:49
4

Caution: When you run your program in the Visual Studio or with any debugger attached, by default the malloc behaviour is changed a lot, Low Fragmentation Heap is not used and a memory overhead may be not representative of real usage (see also https://stackoverflow.com/a/3768820/16673). You need to use environment variable _NO_DEBUG_HEAP=1 to avoid being hit by this, or to measure the memory usage when not running under a debugger.

Community
  • 1
  • 1
Suma
  • 33,181
  • 16
  • 123
  • 191