102

We are developing a highly performance critical software in C++. There we need a concurrent hash map and implemented one. So we wrote a benchmark to figure out, how much slower our concurrent hash map is compared with std::unordered_map.

But, std::unordered_map seems to be incredibly slow... So this is our micro-benchmark (for the concurrent map we spawned a new thread to make sure that locking does not get optimized away and note that I never inser 0 because I also benchmark with google::dense_hash_map, which needs a null value):

boost::random::mt19937 rng;
boost::random::uniform_int_distribution<> dist(std::numeric_limits<uint64_t>::min(), std::numeric_limits<uint64_t>::max());
std::vector<uint64_t> vec(SIZE);
for (int i = 0; i < SIZE; ++i) {
    uint64_t val = 0;
    while (val == 0) {
        val = dist(rng);
    }
    vec[i] = val;
}
std::unordered_map<int, long double> map;
auto begin = std::chrono::high_resolution_clock::now();
for (int i = 0; i < SIZE; ++i) {
    map[vec[i]] = 0.0;
}
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "inserts: " << elapsed.count() << std::endl;
std::random_shuffle(vec.begin(), vec.end());
begin = std::chrono::high_resolution_clock::now();
long double val;
for (int i = 0; i < SIZE; ++i) {
    val = map[vec[i]];
}
end = std::chrono::high_resolution_clock::now();
elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "get: " << elapsed.count() << std::endl;

(EDIT: the whole source code can be found here: http://pastebin.com/vPqf7eya)

The result for std::unordered_map is:

inserts: 35126
get    : 2959

For google::dense_map:

inserts: 3653
get    : 816

For our hand backed concurrent map (which does locking, although the benchmark is single threaded - but in a separate spawn thread):

inserts: 5213
get    : 2594

If I compile the benchmark program without pthread support and run everything in the main thread, I get the following results for our hand backed concurrent map:

inserts: 4441
get    : 1180

I compile with the following command:

g++-4.7 -O3 -DNDEBUG -I/tmp/benchmap/sparsehash-2.0.2/src/ -std=c++11 -pthread main.cc

So especially inserts on std::unordered_map seem to be extremely expensive - 35 seconds vs 3-5 seconds for other maps. Also the lookup time seems to be quite high.

My question: why is this? I read another question on stackoverflow where someone asks, why std::tr1::unordered_map is slower than his own implementation. There the highest rated answer states, that the std::tr1::unordered_map needs to implement a more complicated interface. But I can not see this argument: we use a bucket approach in our concurrent_map, std::unordered_map uses a bucket-approach too (google::dense_hash_map does not, but than std::unordered_map should be at least as fast than our hand backed concurrency-safe version?). Apart from that I can not see anything in the interface that force a feature which makes the hash map perform badly...

So my question: is it true that std::unordered_map seems to be very slow? If no: what is wrong? If yes: what is the reason for that.

And my main question: why is inserting a value into a std::unordered_map so terrible expensive (even if we reserve enough space at the beginning, it does not perform much better - so rehashing seems not to be the problem)?

EDIT:

First of all: yes the presented benchmark is not flawless - this is because we played around a lot with it and it is just a hack (for example the uint64 distribution to generate ints would in practice not be a good idea, exclude 0 in a loop is kind of stupid etc...).

At the moment most comments explain, that I can make the unordered_map faster by preallocating enough space for it. In our application this is just not possible: we are developing a database management system and need a hash map to store some data during a transaction (for example locking information). So this map can be everything from 1 (user just makes one insert and commits) to billions of entries (if full table scans happen). It is just impossible to preallocate enough space here (and just allocate a lot in the beginning will consume too much memory).

Furthermore, I apologize, that I did not state my question clear enough: I am not really interested in making unordered_map fast (using googles dense hash map works fine for us), I just don't really understand where this huge performance differences come from. It can not be just preallocation (even with enough preallocated memory, the dense map is an order of magnitude faster than unordered_map, our hand backed concurrent map starts with an array of size 64 - so a smaller one than unordered_map).

So what is the reason for this bad performance of std::unordered_map? Or differently asked: Could one write an implementation of the std::unordered_map interface which is standard conform and (nearly) as fast as googles dense hash map? Or is there something in the standard that enforces the implementer to chose an inefficient way to implement it?

EDIT 2:

By profiling I see that a lot of time is used for integer divions. std::unordered_map uses prime numbers for the array size, while the other implementations use powers of two. Why does std::unordered_map use prime-numbers? To perform better if the hash is bad? For good hashes it does imho make no difference.

EDIT 3:

These are the numbers for std::map:

inserts: 16462
get    : 16978

Sooooooo: why are inserts into a std::map faster than inserts into a std::unordered_map... I mean WAT? std::map has a worse locality (tree vs array), needs to make more allocations (per insert vs per rehash + plus ~1 for each collision) and, most important: has another algorithmic complexity (O(logn) vs O(1))!

abergmeier
  • 13,224
  • 13
  • 64
  • 120
Markus Pilman
  • 3,104
  • 3
  • 22
  • 31
  • 1
    Most of the containers in std are VERY conservative with their estimates, I'd have a look at the bucket count you're using ( specified in the constructor ), and increase it to a better estimate for your `SIZE`. – Ylisar Jul 23 '12 at 14:16
  • Have you tried concurrent_hash_map from the Intel TBB? http://threadingbuildingblocks.org/docs/help/reference/containers_overview/concurrent_hash_map_cls.htm – MadScientist Jul 23 '12 at 14:18
  • 1. What is SIZE? 2. Why are you specifying a range in terms uint64_t for a uniform_int_distribution of int? This is resulting in a distribution from 0 to -1. What is that doing? 3. If you don't want 0, why don't you just specify that in your distribution? – Howard Hinnant Jul 23 '12 at 14:51
  • 1
    @MadScientist We considered TBB. The problem is the licensing: it is a research project and we are not yet sure how we will publish it (most definitely open source - but if we want to allow the usage in a commercial product, GPLv2 is too restrictive). Also it is another dependency. But may be we will use it in a later point in time, so far we can good live without it. – Markus Pilman Jul 24 '12 at 08:05
  • 1
    Running it under a profiler, e.g. valgrind, can be insightful. – Maxim Egorushkin Jul 24 '12 at 08:15
  • Would be helpful if you posted a complete source so that we can compile it and see what you see. – Maxim Egorushkin Jul 24 '12 at 08:20
  • @MaximYegorushkin Edited the post and added a pastebin link to the full source. My profiler tells me, that a lot of calculation time is consumed for integer division. The other maps we tested used powers of two while std::unordered_map uses prime numbers... That could be a reason. Why do they use prime numbers? – Markus Pilman Jul 24 '12 at 09:14
  • 1
    Locality in a hash table is at best slightly better than locality in a tree, at least if the hash function is "random". That hash function ensures you rarely access nearby items at nearby times. The only advantage you have is that the hashtable array is one contiguous block. That can be true for a tree anyway, if the heap isn't fragmented and you build the tree all at once. Once the size is larger than the cache, differences in locality will make little if any difference to performance. –  Jul 25 '12 at 08:14
  • Take a look at aligned hashing in [ulib](http://code.google.com/p/ulib/), this is amazingly fast and easy to use. The current trunk has also implemented a concurrent version of hashmap as far as I know. –  Sep 25 '12 at 06:02

3 Answers3

87

I found the reason: it is a Problem of gcc-4.7!!

With gcc-4.7

inserts: 37728
get    : 2985

With gcc-4.6

inserts: 2531
get    : 1565

So std::unordered_map in gcc-4.7 is broken (or my installation, which is an installation of gcc-4.7.0 on Ubuntu - and another installation which is gcc 4.7.1 on debian testing).

I will submit a bug report.. until then: DO NOT use std::unordered_map with gcc 4.7!

abergmeier
  • 13,224
  • 13
  • 64
  • 120
Markus Pilman
  • 3,104
  • 3
  • 22
  • 31
  • Is there anything in the delta from 4.6 that would cause that? – Mark Canlas Jul 24 '12 at 16:38
  • 30
    [There is already a report in the mailing list.](http://www.mail-archive.com/gcc-bugs@gcc.gnu.org/msg368041.html) The discussion seem to be pointing to "fixes" to `max_load_factor` handling, which led to the difference in performance. – jxh Jul 24 '12 at 18:21
  • Bad timing for this bug! I was getting very poor performance with unordered_map but I'm glad that it has been reported and "fixed". – Bo Lu Aug 02 '12 at 17:36
  • +1 - What a suck BBBBBUG.. I wonder what happens with gcc-4.8.2 – ikh Mar 25 '14 at 10:54
  • 2
    Any updates on this bug? Does it still exist for later versions of GCC (5+)? – rph Oct 13 '16 at 03:53
  • @rkioji One of the [changes in GCC 7](https://gcc.gnu.org/gcc-7/changes.html) is "A new power-of-two rehashing policy for use with the _Hashtable internals", so performance is probably different now. – Jeffrey Bosboom May 24 '17 at 20:57
21

I am guessing that you have not properly sized your unordered_map, as Ylisar suggested. When chains grow too long in unordered_map, the g++ implementation will automatically rehash to a larger hash table, and this would be a big drag on performance. If I remember correctly, unordered_map defaults to (smallest prime larger than) 100.

I didn't have chrono on my system, so I timed with times().

template <typename TEST>
void time_test (TEST t, const char *m) {
    struct tms start;
    struct tms finish;
    long ticks_per_second;

    times(&start);
    t();
    times(&finish);
    ticks_per_second = sysconf(_SC_CLK_TCK);
    std::cout << "elapsed: "
              << ((finish.tms_utime - start.tms_utime
                   + finish.tms_stime - start.tms_stime)
                  / (1.0 * ticks_per_second))
              << " " << m << std::endl;
}

I used a SIZE of 10000000, and had to change things a bit for my version of boost. Also note, I pre-sized the hash table to match SIZE/DEPTH, where DEPTH is an estimate of the length of the bucket chain due to hash collisions.

Edit: Howard points out to me in comments that the max load factor for unordered_map is 1. So, the DEPTH controls how many times the code will rehash.

#define SIZE 10000000
#define DEPTH 3
std::vector<uint64_t> vec(SIZE);
boost::mt19937 rng;
boost::uniform_int<uint64_t> dist(std::numeric_limits<uint64_t>::min(),
                                  std::numeric_limits<uint64_t>::max());
std::unordered_map<int, long double> map(SIZE/DEPTH);

void
test_insert () {
    for (int i = 0; i < SIZE; ++i) {
        map[vec[i]] = 0.0;
    }
}

void
test_get () {
    long double val;
    for (int i = 0; i < SIZE; ++i) {
        val = map[vec[i]];
    }
}

int main () {
    for (int i = 0; i < SIZE; ++i) {
        uint64_t val = 0;
        while (val == 0) {
            val = dist(rng);
        }
        vec[i] = val;
    }
    time_test(test_insert, "inserts");
    std::random_shuffle(vec.begin(), vec.end());
    time_test(test_insert, "get");
}

Edit:

I modified the code so that I could change out DEPTH more easily.

#ifndef DEPTH
#define DEPTH 10000000
#endif

So, by default, the worst size for the hash table is chosen.

elapsed: 7.12 inserts, elapsed: 2.32 get, -DDEPTH=10000000
elapsed: 6.99 inserts, elapsed: 2.58 get, -DDEPTH=1000000
elapsed: 8.94 inserts, elapsed: 2.18 get, -DDEPTH=100000
elapsed: 5.23 inserts, elapsed: 2.41 get, -DDEPTH=10000
elapsed: 5.35 inserts, elapsed: 2.55 get, -DDEPTH=1000
elapsed: 6.29 inserts, elapsed: 2.05 get, -DDEPTH=100
elapsed: 6.76 inserts, elapsed: 2.03 get, -DDEPTH=10
elapsed: 2.86 inserts, elapsed: 2.29 get, -DDEPTH=1

My conclusion is that there is not much significant performance difference for any initial hash table size other than making it equal to the entire expected number of unique insertions. Also, I don't see the order of magnitude performance difference that you are observing.

jxh
  • 69,070
  • 8
  • 110
  • 193
  • 6
    `std::unordered_map` has a default maximum load factor of 1. So except for the initial number of buckets, your DEPTH is ignored. If desired you can `map.max_load_factor(DEPTH)`. – Howard Hinnant Jul 23 '12 at 15:16
  • @HowardHinnant: Thanks for that info. So the `DEPTH` is ignored, but it still controls how often the map will get rehashed into a larger map. The answer has been updated, and thanks again – jxh Jul 23 '12 at 15:23
  • @user315052 Yes I know I can make it better by giving it a sane size at the beginning - but I can not do that in our software (it's a research project - a DBMS - and there I can not know how much I will insert - it can vary between 0 and 1 billion...). But even with preallication it is slower than our map and way slower than googles dense_map - I am still wondering what it is that makes the big difference. – Markus Pilman Jul 24 '12 at 07:43
  • @MarkusPilman: I don't know how my results compare to yours, because you never provided how large a `SIZE` you were working with. I can say the `unordered_map` is twice as fast with `DEPTH` set to `1` and properly preallocated. – jxh Jul 24 '12 at 08:36
  • @user315052 oh yes sorry, the SIZE I have chosen is 1'000'000 - so the same than yours. I also got more than a factor of 2 speed up for insertions - but this is still extremely slow compared with other implementations – Markus Pilman Jul 24 '12 at 09:00
  • sorry again... I ment 10'000'000 – Markus Pilman Jul 24 '12 at 09:38
  • 1
    @MarkusPilman: My times are already in seconds. I thought your times were in milliseconds. If the insertions with `DEPTH` set to `1` is taking less than `3` seconds, how is this an order of magnitude slower? – jxh Jul 24 '12 at 15:15
  • @user315052 Yes it is milliseconds. I have no idea - we have to different programs written from two different people - 10 million inserts take ~ 35 seconds. The difference for the get could be that my machine is quite old. Bute the one for the inserts is strange... We use gcc 4.7 (forgot to say). I'll try with gcc-4.6 – Markus Pilman Jul 24 '12 at 15:39
2

I have run your code using a 64 bit / AMD / 4 cores (2.1GHz) computer and it gave me the following results:

MinGW-W64 4.9.2:

Using std::unordered_map:

inserts: 9280 
get: 3302

Using std::map:

inserts: 23946
get: 24824

VC 2015 with all the optimization flags I know:

Using std::unordered_map:

inserts: 7289
get: 1908

Using std::map:

inserts: 19222 
get: 19711

I have not tested the code using GCC but I think it may be comparable to the performance of VC, so if that is true, then GCC 4.9 std::unordered_map it's still broken.

[EDIT]

So yes, as someone said in the comments, there is no reason to think that the performance of GCC 4.9.x would be comparable to VC performance. When I have the change I will be testing the code on GCC.

My answer is just to establish some kind of knowledge base to other answers.

  • "I have not tested the code using GCC but I think it may be comparable to the performance of VC." Totally unfounded claim, without any benchmarking comparable to the one found in the original post. This "answer" doesn't answer the question in any sense, let alone answering the "why" question. – 4ae1e1 Nov 16 '15 at 23:19
  • 2
    "I have not tested the code using GCC" ...how is it that you managed to acquire and use MinGW while knowing so little about it? MinGW is fundamentally a closely tracking port of GCC. – underscore_d Jun 26 '16 at 10:00