Why does linux take some extra space when allocating dynamic memory?

Question

I was working on a project that needed fast string search on a large collection of string values. I decided to use Trie for search and this approach was fast. This is part of that project that's relevant to my question:

class TTrieNode{
public:
    char c;
    bool data;
    TTrieNode *left, *mid, *right;  
    TTrieNode(){
        left = mid = right = NULL;
        c = data = 0;
    }
};

class TTrie{
private:
    TTrieNode *root;
    TTrieNode *insert(TTrieNode*n, char *s, int idx){
        char ch = s[idx];
        if(!n){
            n = new TTrieNode();
            n->c = ch;
        }
        if(ch < (n->c)){
            n->left = insert(n->left, s, idx);
        }else if(ch > (n->c)){
            n->right = insert(n->right, s, idx);
        }else if(idx+1 < strlen(s))
            n->mid = insert(n->mid, s, idx+1);
        else
            n->data = true;
        return n;
    }
public:
    TTrie() {
        root = NULL;
    }
    void insert(char *s) {
        root = insert(root, s, 0);
    }
};

Everything was good until we tested my Trie on the real data. Based on my calculation on the number of nodes and the amount of space each node takes, it should have taken ~40GBs of RAM, but to my surprise it took ~70GBs. At first I thought this was because of memory alignment to each node(just a raw guess!), so I used __attribute__((packed, aligned(1))) with my TTrieNode definition! Using this didn't make big of a difference. After a lot of tests I used some manual memory allocation. So instead of calling new each time I want to allocate memory to a new node, I allocated ~50GBs of RAM in the beginnig of my program and used the following custom new function instead:

TTrieNode *memLoc;
int memIdx;
void initMemory(){
    memLoc = (TTrieNode*) malloc(MAXNODES * sizeof(TTrieNode));
    memIdx = 0;
}
TTrieNode*myNew(){
    memLoc[memIdx].left =  memLoc[memIdx].right =  memLoc[memIdx].mid = NULL;
    memLoc[memIdx].c =  memLoc[memIdx].data = 0;
    return &memLoc[memIdx ++];
}

This was very surprising but this time, the program took EXACTLY the amount of memory I was expecting!

Now my questions are these:

Why is some extra memory for each new (malloc)? Is there some kind of pointer in the kernel/user level for each memory allocation? I haven't tested my code in windows(or any other operating system) but would like to know if there is some similar behavior on those operating systems as well.

[Do not cast the return value of `malloc()` in c](http://stackoverflow.com/a/605858/1983495) and please choose a language. And if you want to cast `malloc()` in c++ do it with the right construct `static_cast(malloc(...))`. — Iharob Al Asimi, Sep 24 '15 at 10:55
That part with `malloc` is my solution and doesn't have any problems, I used new in my code (the part that took some extra space) without any casts. — AKJ88, Sep 24 '15 at 10:58
You *do* realize that all pointers `malloc()` returns are aligned to 16 bytes? Also, you *do* realize that memory-management has overhead? — EOF, Sep 24 '15 at 11:00
Congratulations, you've encountered a case where the standard memory allocator performs poorly due to memory it reserves for meta-data or fragmentation prevention. That overhead is acceptable for most applications, however is quite a lot when working with many small objects (such as yours). — StoryTeller - Unslander Monica, Sep 24 '15 at 11:01
@EOF I do realize what you said, based on my calculation I reach the conclusion that there is a 16bytes overhead per object. I just want to know why exactly is that and what meta data is stored, as `StoryTeller` mentioned. — AKJ88, Sep 24 '15 at 11:04
For each allocated object, `malloc()` allocates a couple of words of meta-data so it knows things like how long the memory area is, how it was allocated and other housekeeping data. Try to call `malloc()` less often to fix this (seems like you already figured that out). — fuz, Sep 24 '15 at 11:05
Seconding @StoryTeller. Each chunk malloc'ed (or new'd) must be accompanied by some bookkeeping information, like a size, or a pointer to the next chunk, or whatever. I guess on a 64 bit system that can esaily be 64 bits, aka 8 bytes. If you have short words -- like, natural language words --, that bookkeeping will of course be significant. (Your own allocating algorithm didn't need that because it's not truly dynamic; it just grows.) — Peter - Reinstate Monica, Sep 24 '15 at 11:06
When you malloc a "small" object that exceeds the in-process free pool, malloc asks the OS for a large amount more memory. That excess amount grows with total of previous allocations. At first glance I assumed AKJ88 had taken the 16 byte overhead into account and was asking about that excess allocation. I saw the comment contradicting that before posting my answer. The question was apparently mainly about the 16 byte overhead. But the excess allocation effect is also in play after you take the overhead into account. — JSF, Sep 24 '15 at 14:59

JSF · Answer 1 · 2015-09-24T14:53:00.477

There is an 8 to 16 byte overhead per chunk allocated. In a typical x86_64 allocator, there is an 8 byte overhead needed in order to be able to correctly organize the memory chunks when they get freed. There is also a 16 byte alignment requirement, so that a chunk that is already a multiple of 16 bytes getting the basic 8 byte overhead needs to waste another 8 bytes.

Typical 64-bit design: Each chunk is preceded by an 8-byte control word. Most of the control word is needed to give the size of that chunk, so it can be freed. The bottom few bits are available for other purposes because the size is divisible by 16. The most important of those purposes, is knowing whether the preceding chunk is free. When this chunk is freed, if the preceding one was already free it gets consolidated. It also gets consolidated, if possible, with the next chunk. But being able to do that doesn't take extra info.

The common minimal information is surprising (and elegant), especially the fact that each chunk header must include a bit to say whether the previous chunk is free, but doesn't need a bit to say whether the current chunk is free. For consolidation, you can find the next chunk since you know the size of this chunk. But with minimal information you can't find the previous chunk unless you already know it is free but you don't need to find it unless it is free. So at the end of a free chunk there is a pointer (or equivalently size) to its beginning. So if it is free you can navigate to it from its successor. But if it is not free that is part of the used data, not overhead. You could find out if the successor was free by going to the successor's successor and seeing if its predecessor is free. That is more elegant than using one more of the spare bits, but not necessarily better.

Thanks. Where can I see these in linux documentation? How can I find out how they work? — AKJ88, Sep 24 '15 at 11:36
I don't know where that would be documented and it would be GNU, rather than Linux, because that part of the GNU code is not specific to Linux. About 10 years ago, I read an online paper explaining why a certain malloc algorithm was better than the then current GNU one. That paper included a description of the basics (what I described above) in common between the designs. But I don't have a URL or other source for similar doc. — JSF, Sep 24 '15 at 11:54
The master "documentation" for anything GNU is the source code itself. When I read the source code of malloc, already knowing most of the details of how it worked, I still found it very hard to even fit what I knew it was doing to what was in the source code. It would be close to winning an obfuscated code contest. I managed to sync up what I knew it did with the code well enough to find the detail I was looking for. But it is not an easy read. Despite how difficult it is, it is the only definitive source of how it works. — JSF, Sep 24 '15 at 11:58

Why does linux take some extra space when allocating dynamic memory?

1 Answers1