hashtable needs lots of memory

Question

I've declared and defined the following HashTable class. Note that I needed a hashtable of hashtables so my HashEntry struct contains a HashTable pointer. The public part is not a big deal it has the traditional hash table functions so I removed them for simplicity.

enum Status{ACTIVE, DELETED, EMPTY};
enum Type{DNS_ENTRY, URL_ENTRY};

class HashTable{
private:
    struct HashEntry{
        std::string key;
        Status current_status;
        std::string ip;
        int access_count;
        Type entry_type;
        HashTable *table;

        HashEntry(
                const std::string &k = std::string(),
                Status s = EMPTY,
                const std::string &u = std::string(),
                const int &a = int(),
                Type e = DNS_ENTRY,
                HashTable *t = NULL
                ): key(k), current_status(s), ip(u), access_count(a), entry_type(e), table(t){}
    };

    std::vector<HashEntry> array;
    int currentSize;
public:
    HashTable(int size = 1181, int csz = 0): array(size), currentSize(csz){}
};

I am using quadratic probing and I double the size of the vector in my rehash function when I hit array.size()/2. The following list is used when a larger table size is needed.

int a[16] = {49663, 99907, 181031, 360461,...}

My problem is that this class consumes so much memory. I've just profiled it with massif and found out that it needs 33MB(33 million bytes!) of memory for 125000 insertion. To be clear, actually

1 insertion -> 47352 Bytes

8 insertion -> 48376 Bytes

512 insertion -> 76.27KB

1000 insertion 2MB (array size increased to 49663 here)

27000 insertion-> 8MB (array size increased to 99907 here)

64000 insertion -> 16MB (array size increased to 181031 here)

125000 insertion-> 33MB (array size increased to 360461 here)

These may be unnecessary but I just wanted to show you how memory usage changes with the input. As you can see, when rehashing is done, memory usage doubles. For example, our initial array size was 1181. And we have just seen that 125000 elements -> 33MB.

To debug the problem, I changed the initial size to 360461. Now 127000 insertion does not need rehashing. And I see that 20MB of memory is used with this initial value. That is still huge, but I think it suggests there is a problem with rehashing. The following is my rehash function.

void HashTable::rehash(){
    std::vector<HashEntry> oldArray = array;

    array.resize(nextprime(array.size()));
    for(int j = 0; j < array.size(); j++){
        array[j].current_status = EMPTY;
    }
    for(int i = 0; i < oldArray.size(); i++){
        if(oldArray[i].current_status == ACTIVE){
            insert(oldArray[i].key);
            int pos = findPos(oldArray[i].key);
            array[pos] = oldArray[i];
        }
    }
}
int nextprime(int arraysize){
    int a[16] = {49663, 99907, 181031, 360461, 720703, 1400863, 2800519, 5600533, 11200031, 22000787, 44000027};
    int i = 0;
    while(arraysize >= a[i]){i++;}
    return a[i];
}

This is the insert function used in rehashing and everywhere else.

bool HashTable::insert(const std::string &k){
    int currentPos = findPos(k);
    if(isActive(currentPos)){
        return false;
    }
    array[currentPos] = HashEntry(k, ACTIVE);
    if(++currentSize > array.size() / 2){
        rehash();
    }
    return true;
}

What am I doing wrong here? Even if it's caused by rehashing, when no rehashing is done it is still 20MB and I believe 20MB is way too much for 100k items. This hashtable is supposed to contain like 8 million elements.

Is there a reason for storing an entire table per entry? It might help if you could post the code for assigning `HashTable` to `HashEntry`. — Jason, Jan 02 '16 at 00:01
@Jason Every entry of a hash table can contain a hashtable in its entries. I couldn't think of anything other than this self-referential definition. And of course thank you for your help but I didn't understand what you meant by assigning HashTable to HashEntry. Those are different classes, can they be assigned to each other? — user2694307, Jan 02 '16 at 00:08
@Jason Also I didn't have any nested hashtables during these profilings. It was just one main hash table who had NULL for the HashTables in HashEntries. — user2694307, Jan 02 '16 at 00:11
@A.S.H During these profilings there was no nested hash table. So I didn't even touch them, I just called insert function and got these memory usages. — user2694307, Jan 02 '16 at 00:12
To know what memory size is occuppied by each "non-initialized" Entry (empty strings and no nested tables), you can test it with `cout << sizeof(HashEntry)`. On a windows-64 bit system it is 104 Bytes. The rest depends on what you are doing in your test, what each `std::string` will allocate as buffer size, etc.. — A.S.H, Jan 02 '16 at 00:19
You should template-ize this, to solve your difficulty with the nested hash table, and there is no reason for the entry class constructor to have defaults for its parameters, especially the key. — user207421, Jan 02 '16 at 00:23
@MatsPetersson I've used valgrind with --tool=massif. But it was seen even from windows task manager. — user2694307, Jan 02 '16 at 00:24
Is there a reason you're not using smart pointers? FYI, you can store a network address using only four or eight bytes (`uint32_t`, `uint64_t`). — Jason, Jan 02 '16 at 00:34
@Jason we are not allowed to include any library but vector and string. But I'd like to hear your solution to the problem. Problem is to write a DNS Manager. So I thought DNS values can go into a hash table. And every DNS value can contain an URL List. We should register DNSs and URLs and access them very fast. That's why I came up with nested hashtables. — user2694307, Jan 02 '16 at 00:38
I guess "we are not allowed to" explains why you're not using `std::unordered_map`, which would be the obvious solution. — rici, Jan 02 '16 at 00:45

score 3 · Answer 1 · edited May 23 '17 at 12:23

The fact that 360,461 HashEntry's take 20 MB is hardly surprising. Did you try looking at sizeof(HashEntry)?

Each HashEntry includes two std::strings, a pointer, and three int's. As the old joke has it, it's not easy to answer the question "How long is a string?", in this case because there are a large variety of string implementations and optimizations, so you might find that sizeof(std::string) is anywhere between 4 and 32 bytes. (It would only be 4 bytes on a 32-bit architecture.) In practice, a string requires three pointers and the string itself unless it happens to be empty. If sizeof(std::string) is the same as sizeof(void*), then you've probably got a not-too-recent GNU standard library, in which the std::string is an opaque pointer to a block containing two pointers, a reference count, and the string itself. If sizeof(std::string) is 32 bytes, then you might have a recent GNU standard library implementation in which there is a bit of extra space in the string structure for the short-string optimization. See the answer to Why does libc++'s implementation of std::string take up 3x memory as libstdc++? for some measurements. Let's just say 32 bytes per string, and ignore the details; it won't be off by much.

So two strings (32 bytes each) plus a pointer (8 bytes) plus three ints (another 12 bytes) and four bytes of padding because one of the ints is between two 8-byte aligned objects, and that's a total of 88 bytes per HashEntry. And if you have 360,461 hash entries, that would be 31,720,568 bytes, about 30 MB. The fact that you're "only" using 20MB is probably because you're using the old GNU standard library, which optimizes empty strings to a single pointer, and the majority of your strings are empty strings because half the slots have never been used.

Now, let's take a look at rehash. Reduced to its essentials:

void rehash() {
  std::vector<HashEntry> tmp = array;  /* Copy the entire array */
  array.resize(new_size());            /* Internally does another copy */
  for (auto const& entry : tmp)
    if (entry.used()) array.insert(entry);  /* Yet another copy */
}

At peak, we had two copies of the smaller array as well as the new big array. Even if the new array is only 20 MB, it's not surprising that peak memory usage was almost twice that. (Indeed, this is again surprisingly small, not surprisingly big. Possibly it was not actually necessary to change the address of the new vector because it was at the end of the current allocated memory space, which could just be extended.)

Note that we did two copies of all that data, and array.resize() potentially did another one. Let's see if we can do better:

void rehash() {
  std::vector<HashEntry> tmp(new_size());  /* Make an array of default objects */
  for (auto const& entry: array) 
    if (entry.used()) tmp.insert(entry);   /* Copy into the new array */
  std::swap(tmp, array);                   /* Not a copy, just swap three pointers */
}

This way, we only do one copy. Instead of a (possible) internal copy by resize, we do a bulk construction of the new elements, which should be similar. (It's just zeroing out the memory.)

Also, in the new version we only copy the actual strings once each, instead of twice each, which is the fiddliest part of the copy and thus probably quite a large saving.

Proper string management could reduce that overhead further. rehash doesn't actually need to copy the strings, since they are not changed. So we could keep the strings elsewhere, say in a vector of strings, and just use the index into the vector in the HashEntry. Since you are not expecting to hold billions of strings, only millions, the index could a four-byte int. By also shuffling the HashEntry fields around and reducing the enums to a byte instead of four bytes (in C++11, you can specify the underlying integer type of an enum), the HashEntry could be reduced to 24 bytes, and there wouldn't be a need to leave space for as many string descriptors.

score 2 · Answer 2 · answered Jan 02 '16 at 01:32

Since you are using open addressing, half your hash slots have to be empty. Since HashEntry is quite large, storing a full HashEntry in each empty slot is terribly wasteful.

You should store your HashEntry structs somewhere else and put HashEntry* in your hash table, or switch to chaining with a much denser load factor. Either one will reduce this waste.

Also, if you're going to move HashEntry objects around, swap instead of copying, or use move semantics so you don't have to copy so many strings. Be sure to clear out the strings in any entries you're no longer using.

Also, even though you say you need HashTables of HashTables, you don't really explain why. It's usually more efficient to use one hash table with efficiently represented compound keys if small hash tables are not memory-efficient.

Yes a HashEntry pointer would reduce the memory, I will try that. — user2694307, Jan 02 '16 at 09:17
Also, problem is to write a DNS Manager. So I thought DNS values can go into a hash table. And every DNS value can contain a URL List. We should register DNSs and URLs and access them very fast. That's why I came up with nested hashtables. But I'd like to hear a better solution because mine probably is not sufficient even if in terms of time. — user2694307, Jan 02 '16 at 09:58

score 1 · Answer 3 · answered Jan 02 '16 at 23:14

I have changed my structure a little bit just as you all suggested, but there is this one thing that nobody has noticed.

When rehashing/resizing is being done, my rehashfunction calls insert. In this insert function I am incrementing the currentSize, which holds how many elements a hashtable has. So each time a resizing is needed, currentSize doubles itself while it should have stayed the same. I removed that line and wrote the proper code for rehashing and now I think I'm okay.

I am using two different structs now, and the program consumes 1.6GB memory for 8 million elements, which is what expected due to multibyte strings, and integers. That number was like 7-8GB before.

hashtable needs lots of memory

3 Answers3