11

I've done a few years of c# now, and I'm trying to learn some new stuff. So I decided to have a look at c++, to get to know programming in a different way.

I've been doing loads of reading, but I just started writing some code today.

On my Windows 7/64 bit machine, running VS2010, I created two projects: 1) A c# project that lets me write things the way I'm used to. 2) A c++ "makefile" project that let's me play around, trying to implement the same thing. From what I understand, this ISN'T a .NET project.

I got to trying to populate a dictionary with 10K values. For some reason, the c++ is orders of magnitude slower.

Here's the c# below. Note I put in a function after the time measurement to ensure it wasn't "optimized" away by the compiler:

var freq = System.Diagnostics.Stopwatch.Frequency;

int i;
Dictionary<int, int> dict = new Dictionary<int, int>();
var clock = System.Diagnostics.Stopwatch.StartNew();

for (i = 0; i < 10000; i++)
     dict[i] = i;
clock.Stop();

Console.WriteLine(clock.ElapsedTicks / (decimal)freq * 1000M);
Console.WriteLine(dict.Average(x=>x.Value));
Console.ReadKey(); //Don't want results to vanish off screen

Here's the c++, not much thought has gone into it (trying to learn, right?) int input;

LARGE_INTEGER frequency;        // ticks per second
LARGE_INTEGER t1, t2;           // ticks
double elapsedTime;

// get ticks per second
QueryPerformanceFrequency(&frequency);

int i;
boost::unordered_map<int, int> dict;
// start timer
QueryPerformanceCounter(&t1);

for (i=0;i<10000;i++)
    dict[i]=i;

// stop timer
QueryPerformanceCounter(&t2);

// compute and print the elapsed time in millisec
elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
cout << elapsedTime << " ms insert time\n";
int input;
cin >> input; //don't want console to disappear

Now, some caveats. I managed to find this related SO question. One of the guys wrote a long answer mentioning WOW64 skewing the results. I've set the project to release and gone through the "properties" tab of the c++ project, enabling everything that sounded like it would make it fast. Changed the platform to x64, though I'm not sure whether that addresses his wow64 issue. I'm not that experienced with the compiler options, perhaps you guys have more of a clue?

Oh, and the results: c#:0.32ms c++: 8.26ms. This is a bit strange. Have I misinterpreted something about what .Quad means? I copied the c++ timer code from someplace on the web, going through all the boost installation and include/libfile rigmarole. Or perhaps I am actually using different instruments unwittingly? Or there's some critical compile option that I haven't used? Or maybe the c# code is optimized because the average is a constant?

Here's the c++ command line, from the Property page->C/C++->Command Line: /I"C:\Users\Carlos\Desktop\boost_1_47_0" /Zi /nologo /W3 /WX- /MP /Ox /Oi /Ot /GL /D "_MBCS" /Gm- /EHsc /GS- /Gy- /arch:SSE2 /fp:fast /Zc:wchar_t /Zc:forScope /Fp"x64\Release\MakeTest.pch" /Fa"x64\Release\" /Fo"x64\Release\" /Fd"x64\Release\vc100.pdb" /Gd /errorReport:queue

Any help would be appreciated, thanks.

Community
  • 1
  • 1
Carlos
  • 5,991
  • 6
  • 43
  • 82
  • Have you tried std::map instead of boost::unordered_map? – Violet Giraffe Aug 07 '11 at 19:31
  • Don't trust that other answer too much. His comment about WOW64 in particular is totally off-base, there might be a penalty for system calls (although I don't think that's even significant) but definitely not for math. x86 FPU code runs just as fast with WOW64 as with a 32-bit processor. About half the other things in that answer are off-base as well. – Ben Voigt Aug 07 '11 at 22:30
  • Yeah, I tried map, then I read that it's more similar to SortedDictionary. Had a play around with the types, no difference. – Carlos Aug 08 '11 at 07:14
  • 1
    I cannot reproduce that result at all. My simple C++0x implementation is consistently faster by a good amount that your C# version. I'm using GCC 4.6.1 with `-O3 -march=native` and `gmcs`. – Kerrek SB Aug 11 '11 at 13:53

4 Answers4

14

A simple allocator change will cut that time down a lot.

boost::unordered_map<int, int, boost::hash<int>, std::equal_to<int>, boost::fast_pool_allocator<std::pair<const int, int>>> dict;

0.9ms on my system (from 10ms before). This suggests to me that actually, the vast, vast majority of your time is not spent in the hash table at all, but in the allocator. The reason that this is an unfair comparison is because your GC will never collect in such a trivial program, giving it an undue performance advantage, and native allocators do significant caching of free memory- but that'll never come into play in such a trivial example, because you've never allocated or deallocated anything and so there's nothing to cache.

Finally, the Boost pool implementation is thread-safe, whereas you never play with threads so the GC can just fall back to a single-threaded implementation, which will be much faster.

I resorted to a hand-rolled, non-freeing non-thread-safe pool allocator and got down to 0.525ms for C++ to 0.45ms for C# (on my machine). Conclusion: Your original results were very skewed because of the different memory allocation schemes of the two languages, and once that was resolved, then the difference becomes relatively minimal.

A custom hasher (as described in Alexandre's answer) dropped my C++ time to 0.34ms, which is now faster than C#.

static const int MaxMemorySize = 800000;
static int FreedMemory = 0;
static int AllocatorCalls = 0;
static int DeallocatorCalls = 0;
template <typename T>
class LocalAllocator
{
  public:
      std::vector<char>* memory;
      int* CurrentUsed;
      typedef T value_type;
      typedef value_type * pointer;
      typedef const value_type * const_pointer;
      typedef value_type & reference;
      typedef const value_type & const_reference;
      typedef std::size_t size_type;
      typedef std::size_t difference_type;

    template <typename U> struct rebind { typedef LocalAllocator<U> other; };

    template <typename U>
    LocalAllocator(const LocalAllocator<U>& other) {
        CurrentUsed = other.CurrentUsed;
        memory = other.memory;
    }
    LocalAllocator(std::vector<char>* ptr, int* used) {
        CurrentUsed = used;
        memory = ptr;
    }
    template<typename U> LocalAllocator(LocalAllocator<U>&& other) {
        CurrentUsed = other.CurrentUsed;
        memory = other.memory;
    }
    pointer address(reference r) { return &r; }
    const_pointer address(const_reference s) { return &r; }
    size_type max_size() const { return MaxMemorySize; }
    void construct(pointer ptr, value_type&& t) { new (ptr) T(std::move(t)); }
    void construct(pointer ptr, const value_type & t) { new (ptr) T(t); }
    void destroy(pointer ptr) { static_cast<T*>(ptr)->~T(); }

    bool operator==(const LocalAllocator& other) const { return Memory == other.Memory; }
    bool operator!=(const LocalAllocator&) const { return false; }

    pointer allocate(size_type count) {
        AllocatorCalls++;
        if (*CurrentUsed + (count * sizeof(T)) > MaxMemorySize)
            throw std::bad_alloc();
        if (*CurrentUsed % std::alignment_of<T>::value) {
            *CurrentUsed += (std::alignment_of<T>::value - *CurrentUsed % std::alignment_of<T>::value);
        }
        auto val = &((*memory)[*CurrentUsed]);
        *CurrentUsed += (count * sizeof(T));
        return reinterpret_cast<pointer>(val);
    }
    void deallocate(pointer ptr, size_type n) {
        DeallocatorCalls++;
        FreedMemory += (n * sizeof(T));
    }

    pointer allocate() {
        return allocate(sizeof(T));
    }
    void deallocate(pointer ptr) {
        return deallocate(ptr, 1);
    }
};
int main() {
    LARGE_INTEGER frequency;        // ticks per second
    LARGE_INTEGER t1, t2;           // ticks
    double elapsedTime;

    // get ticks per second
    QueryPerformanceFrequency(&frequency);
    std::vector<char> memory;
    int CurrentUsed = 0;
    memory.resize(MaxMemorySize);

    struct custom_hash {
        size_t operator()(int x) const { return x; }
    };
    boost::unordered_map<int, int, custom_hash, std::equal_to<int>, LocalAllocator<std::pair<const int, int>>> dict(
        std::unordered_map<int, int>().bucket_count(),
        custom_hash(),
        std::equal_to<int>(),
        LocalAllocator<std::pair<const int, int>>(&memory, &CurrentUsed)
    );

    // start timer
    std::string str;
    QueryPerformanceCounter(&t1);

    for (int i=0;i<10000;i++)
        dict[i]=i;

    // stop timer
    QueryPerformanceCounter(&t2);

    // compute and print the elapsed time in millisec
    elapsedTime = ((t2.QuadPart - t1.QuadPart) * 1000.0) / frequency.QuadPart;
    std::cout << elapsedTime << " ms insert time\n";
    int input;
    std::cin >> input; //don't want console to disappear
}
Puppy
  • 144,682
  • 38
  • 256
  • 465
  • Indeed. It is likely that each insertion will create a new node in a bucket. Allocation of small objects in modern garbage collected languages is usually just a pointer increment, whereas the default allocator in C++ must find its way into a complex data structure. – Alexandre C. Aug 11 '11 at 12:28
  • Just got back to looking at this. Works now, fantastic result. (Only 1 issue: you've capitalised 'Memory' in some places and used the lower case elsewhere.) – Carlos Jan 28 '12 at 23:35
3

Storing a consecutive sequence of numeric integral keys added in ascending order is definitely NOT what hash tables are optimized for.

Use an array, or else generate random values.

And do some retrievals. Hash tables are highly optimized for retrieval.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • 1
    @Duck: You don't usually try to understand in great detail why using the wrong data structure has bad performance, you switch to the right data structure. – Ben Voigt Aug 07 '11 at 22:31
  • I disagree. Knowing how a data structure works internally gives you more power in knowing what to use in other circumstances. Plus, what if we don't know what the .NET `Dictionary<>` uses -- perhaps it uses a hash table as well, which still means that your answer does not fully address the performance difference described by the OP. – DuckMaestro Aug 08 '11 at 00:00
  • 2
    @Duck: We do know that `System.Collections.Generic.Dictionary'2` also uses a hash table. But these structures are supposed to be optimized for fast lookup. If you want fast insertion, you'd use a data structure optimized for that. If you want to compare hash tables, you should do the operation they're intended for (lookup). – Ben Voigt Aug 08 '11 at 00:09
  • FYI I haven't done the lookup part of the exercise yet. I figured something is already wrong if I can't get them to be a similar order of magnitude on the insert. – Carlos Aug 08 '11 at 07:16
  • @Carlos: Why? That's like saying there's no point in looking at the performance of Java or C# on real-world tasks, because getting to the first line of user code is 1000 times slower than C++. – Ben Voigt Aug 08 '11 at 21:00
  • Well, I was guessing the problem was my compile options. I'll have a look at the lookup and see what happens. – Carlos Aug 09 '11 at 12:07
3

You can try dict.rehash(n) with different (large) values of n before inserting elements, and see how this impacts performance. Memory allocations (they take place when the container fills buckets) are generally more expensive in C++ than in C#, and rehashing is also heavy. For std::vector and std::deque, the analog member function is reserve.

Different rehash policies and load factor threshold (have a look at the max_load_factor member function) will also greatly impact unordered_map's performance.

Next, since you're using VS2010, I suggest you use std::unordered_map from the <unordered_map> header. Don't use boost when you can use the standard library.

The actual hash function used may greatly impact performance. You may try with the following:

struct custom_hash { size_t operator()(int x) const { return x; } };

and use std::unordered_map<int, int, custom_hash>.

Finally, I agree that this is a poor usage of hash tables. Use random values for insertion, you'll get a more precise picture of what is going on. Testing insertion speeds of hash tables isn't stupid at all, but hash tables are not meant to store consecutive integers. Use a vector for this.

Alexandre C.
  • 55,948
  • 11
  • 128
  • 197
0

Visual Studio TR1 unordered_map is the same as stdext::hash_map:

Another thread asking why it performs slow, see my answer with links to others that have discovered the same issue. The conclusion is to use another hash_map implementation when in C+++:

Alternative to stdext::hash_map for performance reasons

Btw. remember when in C++ then there is big difference between optimized Release-build and non-optimized Debug-build compared to C#.

Community
  • 1
  • 1
Rolf Kristensen
  • 17,785
  • 1
  • 51
  • 70