16

I am working on an external sorting algorithm that uses std::queue and must carefully constrain its memory usage. I have noticed that during the merge phase (which uses several std::queues of fixed length), my memory usage increases to about 2.5X what I expected. Since std::queue by default uses std::deque as its underlying container, I ran some tests on std::deque to determine its memory overhead. Here are the results, running on VC++ 9, in release mode, with a 64-bit process:

When adding 100,000,000 chars to a std::deque, the memory usage grows to 252,216K. Note that 100M chars (1 byte) should occupy 97,656K, so this is an overhead of 154,560K.

I repeated the test with doubles (8 bytes) and saw memory grow to 1,976,676K, while 100M doubles should occupy 781,250K, for an overhead of 1,195,426K!!

Now I understand that std::deque is normally implemented as a linked list of "chunks." If this is true, then why is the overhead proportional to the element size (because of course the pointer size should be fixed at 8 bytes)? And why is it so danged huge?

Can anybody shed some light on why std::deque uses so much danged memory? I'm thinking I should switch my std::queue underlying containers to std::vector as there is no overhead (given that the appropriate size is reserveed). I'm thinking the benefits of std::deque are largely negated by the fact that it has such a huge overhead (resulting in cache misses, page faults, etc.), and that the cost of copying std::vector elements may be less, given that the overall memory usage is so much lower. Is this just a bad implementation of std::deque by Microsoft?

Tabber33
  • 535
  • 3
  • 11
  • First question. How did you determine memory usage. As some methods are not as accurate as others. – Martin York Nov 03 '10 at 16:15
  • @Martin, I'm just observing the "Peak Working Set" value for the process in Task Manger. – Tabber33 Nov 03 '10 at 16:19
  • If you write a program to allocate 2M of data (in chunks) then release it all then wait for the user input before exiting does it exhibit the same behavior. i.e. memory ramps up then gets to a steady state but does not go down. PS> I can not find "Peak Working Set" – Martin York Nov 03 '10 at 17:22
  • @Martin, I just tested this. It ramps up, then I pause for user input. Then I free it (by using an extra block scope). Then pause for user input again, the memory drops to a few K, as expected. BTW, "Peak Working Set" is not shown by default. You have to show that column in Task Manager (I'm on Windows 7). – Tabber33 Nov 03 '10 at 17:34
  • OK. I don't know enough about the new task manager to really comment. But I am not yet convinced that you are seeing what you think you are seeing. Memory allocated to a processes does not equal to memory being used by the processes (because a free/delete does not return that memory to the OS it is retained by the runtime memory manager so that it can be used in subsequent memory requests). It is this memory I believe (don't know for sure) that task manager is showing. – Martin York Nov 03 '10 at 18:34
  • 1
    I think I have enough of a handle on Windows' virtual memory system to say that when memory is freed (`delete`d in C++), it is indeed returned to the OS. In Windows, memory can be "reserved" and "committed." The "peak working set" is showing the largest amount of physical RAM used by the process over its lifetime, which equals the "commited" virtual memory when the amount is not large enough to require the swap file. I have confirmed that `delete`ing memory causes the memory to be released to the OS, and the working set size decreases accordingly. In my simple test, ... – Tabber33 Nov 03 '10 at 19:00
  • 1
    ... there is no freeing of memory. The memory usage increases steadily in a loop. Unlike `std::vector` which will indeed allocate a new contiguous block *larger* than the existing one (and this is reflected by a difference between the "peak working set" and the "working set"), copy the elements, and then free the original, this does not happen with a `std::deque`. It grows in 16-byte chunks (as determined by below posts) and does not deallocate memory on inserts ever. – Tabber33 Nov 03 '10 at 19:02

3 Answers3

15

Look at the code for _DEQUESIZ (number of elements per block):

#define _DEQUESIZ   (sizeof (_Ty) <= 1 ? 16 \
    : sizeof (_Ty) <= 2 ? 8 \
    : sizeof (_Ty) <= 4 ? 4 \
    : sizeof (_Ty) <= 8 ? 2 : 1)    /* elements per block (a power of 2) */

It gets smaller if the element is larger. Only for elements larger than 8 bytes will you get the expected behavior (percentual decrease of overhead with increase of element size).

Dialecticus
  • 16,400
  • 7
  • 43
  • 103
  • @Dialecticus, very interesting! But why is the overhead proportion the same for 1-byte chars as it is for 8-byte doubles? – Tabber33 Nov 03 '10 at 16:24
  • Each block has constant overhead (most likely). In both cases the size of the block is constant as well (16 bytes) so overhead in bytes is constant. – Dialecticus Nov 03 '10 at 16:40
  • 4
    Is this Microsoft's code? Why would they try to use blocks that are only 16 bytes? That's crazy! It would certainly explain why the overhead is extremely bad! Typical memory managers have 8-16 bytes of overhead per block allocated; debug builds and 64-bit processes are worse. Plus, so many allocations (and deallocations) could be slow. – Qwertie Nov 03 '10 at 16:50
  • 2
    It is Dinkumware's code -- the implementation of Microsoft's C++ Standard Library. – Tabber33 Nov 03 '10 at 16:54
  • 4
    @Qwertie: I also like the fact that instead of using a `static size_t const` member of `deque` they prefer to compute the thing over and over and rely on the compiler to optimize the computation. It will certainly do, but still this seems an abuse of macro... – Matthieu M. Nov 04 '10 at 16:16
  • 1
    Plus, I forgot to mention the 4+ bytes of overhead you need to store a pointer to each block. By the way, it is possible for a memory manager to achieve small overhead (0-4 bytes) when the user allocates a lot of small same-size blocks... I know because I wrote a memory manager specifically for this purpose. But I wouldn't assume Microsoft provides a highly efficient memory manager, and I hear that debug builds have 36 bytes overhead per block AND the total size is rounded up to the next 16 bytes; see http://www.nobugs.org/developer/win32/debug_crt_heap.html – Qwertie Nov 04 '10 at 16:43
  • @Qwertie - you should enable LFH on CRT heap for best performance – Steve Townsend Nov 04 '10 at 17:40
  • 1
    How does the "Low-fragmentation Heap" affect the bytes-per-allocation overhead? – Qwertie Nov 04 '10 at 20:31
3

Is it possible that you are running Debug binaries? 252MB for 100M chars does seem like a lot...

You can check attribution of this using umdh to snapshot before and after and then compare the two - might shed some light on why it's larger than you expected.

EDIT: FYI - When I run this outside the debugger on VS2010 I get 181MB with chars.

deque<char> mydequeue;
for (size_t i = 0; i < 100 * 1024 * 1024; ++i)
{
  mydequeue.push_back(char(i));
}

EDIT: Supporting the other answer from @Dialecticus, this gives me the same footprint as double:

struct twoInt64s
{
public:
    twoInt64s(__int64 _a, __int64 _b) : a(_a), b(_b) {}

    __int64 a;
    __int64 b;
};

EDIT: With _DEQUESIZ modified as shown (128 chars per block), 100M chars now takes 113M of memory.

My conclusion is that the remaining overhead you saw is due to management structures for the deque blocks, which have 16 chars of data, plus control info for deque plus more control info for heap manager.

#define _DEQUESIZ   (sizeof (value_type) <= 1 ? 128 \
    : sizeof (value_type) <= 2 ? 8 \
    : sizeof (value_type) <= 4 ? 4 \
    : sizeof (value_type) <= 8 ? 2 \
    : 1)    /* elements per block (a power of 2) */

Moral - if you really want to optimize this for your special purpose, be prepared to play with <deque>. Its behaviour depends critically on the size of your elements, and beyond that on the expected usage pattern.

EDIT: Depending on your knowledge of queue sizes, you might be able to drop in boost::circular_buffer as a replacement for the std::queue container. I bet this would perform more like you want (and expected).

Steve Townsend
  • 53,498
  • 9
  • 91
  • 140
  • No it is not possible. I am building my test application in release mode (with /O2 optimization) and linking to the release runtime library. In debug mode, the memory usage is much higher: 4,327,684K for the 100M doubles case – Tabber33 Nov 03 '10 at 16:10
  • @Tabber33 - see edit, run your measurements outside the IDE to avoid inefficient heap options – Steve Townsend Nov 03 '10 at 16:15
  • Just ran outside the IDE -- memory usage is identical. – Tabber33 Nov 03 '10 at 16:18
  • @Tabber33 - how strange. FWIW I get 754 MB with 25M `double`s. This is Private Bytes from Process Explorer. – Steve Townsend Nov 03 '10 at 16:19
  • @Steve, that's even worse. What is going on here? – Tabber33 Nov 03 '10 at 16:26
  • @Tabber33 - see latest edit. Effect of Heap Manager cannot be discounted, either. UMDH shows your allocs by callstack and should be easy to use here to get more trustworthy attribution for them – Steve Townsend Nov 03 '10 at 16:31
  • @Steve, what do you mean by "attribution?" – Tabber33 Nov 03 '10 at 16:37
  • @Tabber33 - "What memory does your code own, and what callstack caused it to be allocated?" Remember total process size (depending on how you view it) may include stuff that you have freed, but the heap manager did not yet return to the OS. – Steve Townsend Nov 03 '10 at 16:39
  • @Steve, I understand your point, but I'm willing to rule that out since I'm not freeing anything in my 3-line test (and I presume the deque is not freeing anything internally either). Also in my actual sort application, I see the memory usage sustained consistently at 2.5X what I expect, until the process completes, even though it must grow the swap file to handle this. – Tabber33 Nov 03 '10 at 16:46
  • 3
    You can also run with the environment variable `_NO_DEBUG_HEAP=1` to disable the debug heap even in the IDE. – MSN Nov 03 '10 at 16:50
  • @Tabber33 - see latest edit - drastically reduced overhead by changing `deque` blocksize. – Steve Townsend Nov 03 '10 at 17:05
  • @Steve, thanks, that makes sense, but I'm not willing to mess with ``. I'm writing portable code here for one thing, and since the algorithm depends on intense control over memory, I won't be using STL containers -- I guess I'll have to roll my own. – Tabber33 Nov 03 '10 at 17:29
  • @Tabber33 - agreed, this seems impractical if you use small elements in large numbers. best of luck. – Steve Townsend Nov 03 '10 at 17:33
  • @Tabber33 - fyi `Private Bytes` with 100000000 elements in `boost::circular_buffer` = 98056K – Steve Townsend Nov 04 '10 at 19:53
  • Hmm, where's it hiding the other 1,944,000 elements? – Tabber33 Nov 04 '10 at 20:40
  • @Tabber33 - 98056K = 98056 * 1024 – Steve Townsend Nov 04 '10 at 20:41
-2

Without looking at the actual implementation of std::queue you're using, my guess is that its memory allocation looks something like this:

if (new element won't fit) {
    double the size of the backing storage
    realloc the buffer (which will probably copy all elements)
}

The reason for doubling rather than being more conservative is that you want the queue.push_pack operation to have O(1) average time. Since the reallocation may copy the existing elements, a version that only grew the array as needed (1 element at a time) would be O(n^2) as you initially push all of your values into the queue. I'll leave it as an exercise for the reader how the doubling version gives constant average time.

Since you are quoting the size of the entire process, your estimate of about 2x overhead when you push slightly more than a power of 2 (2^26 < 100MM < 2^27) worth of elements seems reasonable. Try stopping at at 2^(n-1), measuring, then pushing a few elements and measuring again.

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • 6
    `queue` is an adapter for `deque`, by default. `deque` does not work like this - it allocates memory in chunks and keeps a chain of them to represent the overall container. `vector` is more likely to do what you described here. – Steve Townsend Nov 03 '10 at 16:48
  • 2
    What you describe is somewhat accurate for `std::vector` when `reserve` is not called, but not at all accurate for `std::deque`. – Tabber33 Nov 03 '10 at 16:50