Performance of hash table, why is C++ the slowest?

Question

Ran a simple performance test on hash, it appears C++ version is the slower than both the perl version and the golang version.

perl version took about 200 millisecond,
C++ version took 280 millisecond.
golang version took 56 millisecond.

On my PC with Core(TM) i7-2670QM CPU @ 2.20GHz, Ubuntu 14.04.3LTS,

Any ideas?

perl version

use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep
                      clock_gettime clock_getres clock_nanosleep clock
                      stat );
sub getTS {
    my ($seconds, $microseconds) = gettimeofday;
    return $seconds + (0.0+ $microseconds)/1000000.0;
}
my %mymap;
$mymap{"U.S."} = "Washington";
$mymap{"U.K."} = "London";
$mymap{"France"} = "Paris";
$mymap{"Russia"} = "Moscow";
$mymap{"China"} = "Beijing";
$mymap{"Germany"} = "Berlin";
$mymap{"Japan"} = "Tokyo";
$mymap{"China"} = "Beijing";
$mymap{"Italy"} = "Rome";
$mymap{"Spain"} = "Madrad";
$x = "";
$start = getTS();
for ($i=0; $i<1000000; $i++) {
    $x = $mymap{"China"};
}
printf "took %f sec\n", getTS() - $start;

C++ version

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec/1000000.0;
}
using namespace std;
int main () {
  std::unordered_map<std::string,std::string> mymap;

  // populating container:
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";  

  double start = getTS();
  string x;
  for (int i=0; i<1000000; i++) {
      mymap["China"];
  }
  printf("took %f sec\n", getTS() - start);
  return 0;
}

Golang version

package main

import "fmt"
import "time"

func main() {
    var x string
    mymap := make(map[string]string)
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";
    t0 := time.Now()
    sum := 1
    for sum < 1000000 {
        x = mymap["China"]
        sum += 1
    }
    t1 := time.Now()
    fmt.Printf("The call took %v to run.\n", t1.Sub(t0))
    fmt.Println(x)
}

Update 1

To improve C++ version, changed x = mymap["China"]; to mymap["China"];, but there is very little difference in performance.

Update 2

I got the original result when compiling without any optimization: g++ -std=c++11 unorderedMap.cc. With "-O2" optimization, it cost only about half the time (150ms)

Update 3

To remove the possible char* to string constructor call, I created a string constant. The time comes down to about 220ms (no optimization in compiling). Thanks to the suggestion from @neil-kirk, with optimization, (-O2 flag), the time is about 80ms.

  double start = getTS();
  string x = "China";
  for (int i=0; i<1000000; i++) {
      mymap[x];
  }

Update 4

Thanks to @steffen-ullrich who pointed out there is a syntax error for perl version. I Changed it. The performance number is about 150ms.

Update 5

It appears the number of executed instructions matters. Using command valgrind --tool=cachegrind <cmd>

For Go version

$ valgrind --tool=cachegrind  ./te1
==2103== Cachegrind, a cache and branch-prediction profiler
==2103== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2103== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2103== Command: ./te1
==2103== 
--2103-- warning: L3 cache found, using its data for the LL simulation.
The call took 1.647099s to run.
Beijing
==2103== 
==2103== I   refs:      255,763,381
==2103== I1  misses:          3,709
==2103== LLi misses:          2,743
==2103== I1  miss rate:        0.00%
==2103== LLi miss rate:        0.00%
==2103== 
==2103== D   refs:      109,437,132  (77,838,331 rd   + 31,598,801 wr)
==2103== D1  misses:        352,474  (   254,714 rd   +     97,760 wr)
==2103== LLd misses:        149,260  (    96,250 rd   +     53,010 wr)
==2103== D1  miss rate:         0.3% (       0.3%     +        0.3%  )
==2103== LLd miss rate:         0.1% (       0.1%     +        0.1%  )
==2103== 
==2103== LL refs:           356,183  (   258,423 rd   +     97,760 wr)
==2103== LL misses:         152,003  (    98,993 rd   +     53,010 wr)
==2103== LL miss rate:          0.0% (       0.0%     +        0.1%  )

For C++ optimized version (no optimization flag)

$ valgrind --tool=cachegrind ./a.out
==2180== Cachegrind, a cache and branch-prediction profiler
==2180== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2180== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2180== Command: ./a.out
==2180== 
--2180-- warning: L3 cache found, using its data for the LL simulation.
took 64.657681 sec
==2180== 
==2180== I   refs:      5,281,474,482
==2180== I1  misses:            1,710
==2180== LLi misses:            1,651
==2180== I1  miss rate:          0.00%
==2180== LLi miss rate:          0.00%
==2180== 
==2180== D   refs:      3,170,495,683  (1,840,363,429 rd   + 1,330,132,254 wr)
==2180== D1  misses:           12,055  (       10,374 rd   +         1,681 wr)
==2180== LLd misses:            7,383  (        6,132 rd   +         1,251 wr)
==2180== D1  miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== LLd miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== 
==2180== LL refs:              13,765  (       12,084 rd   +         1,681 wr)
==2180== LL misses:             9,034  (        7,783 rd   +         1,251 wr)
==2180== LL miss rate:            0.0% (          0.0%     +           0.0%  )

For C++ optimized version

$ valgrind --tool=cachegrind ./a.out
==2157== Cachegrind, a cache and branch-prediction profiler
==2157== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2157== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2157== Command: ./a.out
==2157== 
--2157-- warning: L3 cache found, using its data for the LL simulation.
took 9.419447 sec
==2157== 
==2157== I   refs:      1,451,459,660
==2157== I1  misses:            1,599
==2157== LLi misses:            1,549
==2157== I1  miss rate:          0.00%
==2157== LLi miss rate:          0.00%
==2157== 
==2157== D   refs:        430,486,197  (340,358,108 rd   + 90,128,089 wr)
==2157== D1  misses:           12,008  (     10,337 rd   +      1,671 wr)
==2157== LLd misses:            7,372  (      6,120 rd   +      1,252 wr)
==2157== D1  miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== LLd miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== 
==2157== LL refs:              13,607  (     11,936 rd   +      1,671 wr)
==2157== LL misses:             8,921  (      7,669 rd   +      1,252 wr)
==2157== LL miss rate:            0.0% (        0.0%     +        0.0%  )

Is it possible that the C++ implementation is constructing a new `std::string` every time it is doing a lookup? — dreamlax, Nov 27 '15 at 05:10
Yes cache the key in a local string variable outside the for loop. — Neil Kirk, Nov 27 '15 at 05:11
Can you inspect the assembly output for the C++ and the Go versions? It's entirely possible that a compiler is being clever, and doing something like eliminating the loop because it has no side effects. — James Picone, Nov 27 '15 at 05:11
Also grab the string in x with a reference or pointer to avoid copying. — Neil Kirk, Nov 27 '15 at 05:12
@Galik: I would say not. When I compile the code without `-O2` it takes almost the same amount as time as OP, but with `-O2` it takes no time at all. — dreamlax, Nov 27 '15 at 05:15
Thanks for the suggestions, I updated the C++ version (see "Update 1") but there is not much difference. — packetie, Nov 27 '15 at 05:17
The conversion of the string literal to a string is a greater cost than the return value. — Neil Kirk, Nov 27 '15 at 05:18
@dreamlax optimization probably removes the hash access (the result of which is unused) entirely. Probably need to assign it to a `volatile` variable or some such. — hobbs, Nov 27 '15 at 05:19
outside the for loop, add string key = "China"; in for loop use mymap[key] — cezheng, Nov 27 '15 at 05:26
@cezheng, no difference between `mymap.at(key);` and `mymap[key];` — packetie, Nov 27 '15 at 05:47
@neil-kirk In update 3, with optimization, it's about 80ms. So optimization really made a huge difference here. — packetie, Nov 27 '15 at 05:49
@MooseBoys no sir. I didn't. Your question got me curious and I ran it from gdb (on the update 3 version, non-optimized), it's only marginally slower than running directly: about 230ms. — packetie, Nov 27 '15 at 06:02
I would actually not care about these benchmarks because at least at the moment the Perl code is not doing what it should do, even after it got "fixes". I don't know about the other code but it might be wrong too or the compiler would have optimized stuff away. Definitely far from any reliable benchmarks. — Steffen Ullrich, Nov 27 '15 at 06:42
Apart from that: using benchmarks which run not even a second is useless because it is pure luck in which power mode the processor is currently and what processes might just run in parallel and eat CPU time. Real benchmarks run hours to make sure that the benchmark is not affected by such problems too much, — Steffen Ullrich, Nov 27 '15 at 06:48
Why didn't you improved perl version `$x = $mymap{"China"};` to `$mymap{"China"}` although it will give you a `warning` and change this `for ($i=0; $i<1000000; $i++)` to perlish `for my $i(0 .. 1000000)`. — Arunesh Singh, Nov 27 '15 at 07:40
Perl's version might not be the fastest since it has to deal with two string storage formats. It also has heavy data types. There's also time spent storing the keys into a second hash so they can be used by multiple hashes. Perl's version is very robust to attacks, which may or may not have an effect on performance. Also, `for ($i=0; $i<1000000; $i++)` is an inefficient way of writing `for my $i (0..999999)`. — ikegami, Nov 27 '15 at 21:22
PS, `mymap["China"];` should be `x = mymap["China"];` in the C++ version to be equivalent to the other two. — ikegami, Nov 27 '15 at 21:22
I find it very interesting that you got the same numbers for `$mymap["China"]` (effectively `$mymap[0]`) as for `$mymap{"China"}`, or did you forget to the update the time after changing the code? — ikegami, Nov 27 '15 at 21:26

Steffen Ullrich · Answer 1 · 2015-11-27T07:00:16.043

From your Perl code (before your attempts to fix it):

@mymap = ();
$mymap["U.S."] = "Washington";
$mymap["U.K."] = "London";

This is not a map but an array. The syntax for hash maps is:

  %mymap;
  $mymap{"U.S."} = ....

Thus what you effectively do is to create an array and not a hash map and access the element 0 all the time. Please, use use strict; and use warnings; all the time with Perl and even a simple syntax check with warnings would have shown you the problem:

perl -cw x.pl
Argument "U.S." isn't numeric in array element at x.pl line 9.
Argument "U.K." isn't numeric in array element at x.pl line 10.

Apart from that the main part of the benchmark effectively does nothing useful (assign a variable and never use it) and some compilers can detect it and simply optimize it away.

If you would check the code generated by your Perl program you would see:

$ perl -MO=Deparse x.pl
@mymap = ();
$mymap[0] = 'Washington';
$mymap[0] = 'London';
...
for ($i = 0; $i < 1000000; ++$i) {
    $x = $mymap[0];
}

That is it detects the problem at compile time and replaces it with an access to array index 0.

Thus whenever do benchmarks you need to:

Check that your program actually does what it is supposed to.
Check that the compiler does not optimize things away and does not execute stuff at compile time where other languages will do it at run time. Any kind of statements with no or predictable results are prone to such optimization.
Verify that you actually measure what you intend to measure and not something else. Even small changes to the program can affect the run time because there is a memory allocation needed were not was before etc and these changes might not be related to what you intend to measure. In your benchmark you measure the access to the same hash element again and again without any access to other elements in between. This is an activity which can play very nice with processor caches but probably does not reflect real world use.

And, using a simple timer is not a realistic benchmark. There are other processes on the system, there is the scheduler, there is cache trashing... and with today's CPU it highly depends on the load on the system because maybe the CPU will run one benchmark in a lower power mode than the other benchmarks, i.e. with a different CPU clock. For example multiple runs of the same "benchmark" vary in the measured time between 100ms and 150ms on my system.

Benchmarks are lies and micro benchmarks like yours doubly so.

Thanks @steffen-ullrich for pointing the syntax. It's now fixed, but performance number didn't go up. — packetie, Nov 27 '15 at 06:33
@codingFun: Don't expect the number to go up because it should do actually more complex operations after the code got fixed. Also like I said: benchmarks are lies and micro benchmarks like yours doubly so. Read the parts about CPU, power mode, scheduler.... And verify what the other code does, based on the generated code and not your input. Like I said a smart compiler might realize that it does nothing useful and optimize it away. — Steffen Ullrich, Nov 27 '15 at 06:34
@codingFun: Apart from that your code is still wrong and if you would actually use warnings like I've recommended you would see it. — Steffen Ullrich, Nov 27 '15 at 06:40
you are right. Micro benchmark can only give a very rough idea. More study would be needed to make a choice on which language to choose for a project. By the way, I had the typo in `$x = $mymap["China"];` when I was editing it. It's now updated now. Performance is about 180ms. — packetie, Nov 27 '15 at 06:46
@codingFun: as you can see the performance varies a lot when making small changes. Just a simple change from `$x = $mymap{..}` to `my $x = $mymap{...}` will affect the performance significantly. And on my system there are differences of 50% between multiple runs of the program. — Steffen Ullrich, Nov 27 '15 at 06:50
you are right again, I tried `my $x = $mymap{...}` and the speed does increase (time is reduced) by about 25%. Since 200ms is a short time, there are big difference from run to run (as you pointed out), I stabilized it by running the loop of 30M (instead of 1M) and normalized the time for 1M loops. — packetie, Nov 27 '15 at 06:57

Jens · Answer 2 · 2015-11-27T11:35:11.027

I've modified your example a little bit to get some details about the structure of the hash table:

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>
#include <chrono>

using namespace std;
int main()
{
    std::unordered_map<std::string, std::string> mymap;

    // populating container:
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";

    std::hash<std::string> h;
    for ( auto const& i : mymap )
    {

        printf( "hash(%s) = %ud\n", i.first.c_str(), h( i.first ) );
    }

    for ( int i = 0; i != mymap.bucket_count(); ++i )
    {
        auto const bucketsize = mymap.bucket_size( i );
        if ( 0 != bucketsize )
        {
            printf( "size of bucket %d = %d\n", i, bucketsize );
        }
    }

    auto const start = std::chrono::steady_clock::now();

    string const x = "China";
    std::string res;

    for ( int i = 0; i < 1000000; i++ )
    {
        mymap.find( x );
    }

    auto const elapsed = std::chrono::steady_clock::now() - start;
    printf( "%s\n", res );
    printf( "took %d ms\n",
            std::chrono::duration_cast<std::chrono::milliseconds>( elapsed ).count() );
    return 0;
}

Running this on my system, I get a runtime of ~68ms with the following output:

hash(Japan) = 3611029618d
hash(Spain) = 749986602d
hash(China) = 3154384700d
hash(U.S.) = 2546447179d
hash(Italy) = 2246786301d
hash(Germany) = 2319993784d
hash(U.K.) = 2699630607d
hash(France) = 3266727934d
hash(Russia) = 3992029278d
size of bucket 0 = 0
size of bucket 1 = 0
size of bucket 2 = 1
size of bucket 3 = 1
size of bucket 4 = 1
size of bucket 5 = 0
size of bucket 6 = 1
size of bucket 7 = 0
size of bucket 8 = 0
size of bucket 9 = 2
size of bucket 10 = 3

It turns out that the hash table is not well optimized and contains some collisions. Further printing the elements in the bucket shows that Spain and China are in bucket 9. The bucket is probably a linked list with nodes distributed in memory, explaining the performance drop.

If you choose another hash table size such that there are no collisions, you can get a speedup. I tested it with adding mymap.rehash(1001) and got I speedup of 20-30% to something between 44-52ms.

Now, another point would be computation of the hash values for "China". The function is executed in each iteration. We can make this go away when we switch to constant plain C strings:

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>
#include <chrono>

static auto constexpr us = "U.S.";
static auto constexpr uk = "U.K.";
static auto constexpr fr = "France";
static auto constexpr ru = "Russia";
static auto constexpr cn = "China";
static auto constexpr ge = "Germany";
static auto constexpr jp = "Japan";
static auto constexpr it = "Italy";
static auto constexpr sp = "Spain";

using namespace std;
int main()
{
    std::unordered_map<const char*, std::string> mymap;

    // populating container:
    mymap[us] = "Washington";
    mymap[uk] = "London";
    mymap[fr] = "Paris";
    mymap[ru] = "Moscow";
    mymap[cn] = "Beijing";
    mymap[ge] = "Berlin";
    mymap[jp] = "Tokyo";
    mymap[it] = "Rome";
    mymap[sp] = "Madrad";

    string const x = "China";
    char const* res = nullptr;
    auto const start = std::chrono::steady_clock::now();
    for ( int i = 0; i < 1000000; i++ )
    {
        res = mymap[cn].c_str();
    }

    auto const elapsed = std::chrono::steady_clock::now() - start;
    printf( "%s\n", res );
    printf( "took %d ms\n",
            std::chrono::duration_cast<std::chrono::milliseconds>( elapsed ).count() );
    return 0;
}

On my machine, this reduces the runtime by 50% to ~20ms. The difference is that instead of computing the hash value from the string contents, it now just transforms the address into a hash value which is much faster because it just returns the address value as a size_t. We also do not need to rehash because there are no collisions for the bucket with cn.

Switching to `char*` means that you use reference equality, so the code won't work if you pass it strings which come from a different source. — CodesInChaos, Nov 27 '15 at 12:46
@CodesInChaos You are right, but I don't think this is the point of that micro-benchmark. — Jens, Nov 27 '15 at 12:50
Thanks @Jens for the detailed explanation. The collision is indeed a big problem. Using a different key, "France", improves the performance by about 25%. Comparing pointers does improve the performance tremendously but it's not what we need here. I believe the performance has a lot to do with the number of instructions executed, as james-picone pointed out on the assembly generated. Created a update 5 on the question. Thanks. — packetie, Nov 27 '15 at 15:54
@codingFun The example using the pointers shows that a good part of the runtime is spent computing hash values, because this is what changes with the type. I guess that the go version either has a much better hash function for strings, maybe caching the has value and thus computing it only once, or that it uses some other strategy to implement a map for small strings. — Jens, Nov 27 '15 at 16:09
@jens, same here. Just wish C++ implementation could catch up with that of go. — packetie, Nov 27 '15 at 16:36

score 3 · Answer 3 · answered Nov 27 '15 at 07:14

This just shows that for this particular use case Go hash map implementation is optimized very well.

mymap["China"] calls mapaccess1_faststr that is specifically optimized for string keys. Particularly for small one-bucket maps the hash code is not even calculated for short (less than 32 bytes) strings.

score 2 · Answer 4 · edited May 23 '17 at 12:30

This is a guess:

unordered_map::operator[] requires a string argument. You are providing a char*. Without optimisations on, the C++ version is likely calling the std::string(char*) constructor a million times in order to turn "China" into a std::string. Go's language spec likely makes "string literals" of the same type as string, so no constructors need to be called.

With optimisations on, the string constructor will get hoisted out of the loop and you won't see the same problem. Or quite possibly no code will be generated except the two system calls to get the time and the system call to print the difference, because ultimately it all has no effect.

To confirm that, you'd have to actually look at what assembly is being generated. That'll be a compiler option. See this question for the flags required for GCC.

removing the extra constructor call helps, but it seems C++ version is still fairly slow. Thanks! — packetie, Nov 27 '15 at 05:30
I'm still really curious to see what the difference between the Go assembly and the C++ assembly are. — James Picone, Nov 27 '15 at 05:49

Performance of hash table, why is C++ the slowest?

4 Answers4