129

It is common knowledge in programming that memory locality improves performance a lot due to cache hits. I recently found out about boost::flat_map which is a vector based implementation of a map. It doesn't seem to be nearly as popular as your typical map/unordered_map so I haven't been able to find any performance comparisons. How does it compare and what are the best use cases for it?

Thanks!

nobody
  • 19,814
  • 17
  • 56
  • 77
naumcho
  • 18,671
  • 14
  • 48
  • 59
  • Important to note that https://www.boost.org/doc/libs/1_70_0/doc/html/boost/container/flat_map.html claims random insertion takes logarithmic time, implying populating a boost::flat_map (by inserting n random elements) takes O(n log n) time. It's lying, as is evident from the graphs in @v.oddou 's answer below: random insert is O(n), and n of them takes O(n^2) time. – Don Hatch Jun 14 '19 at 00:20
  • 1
    @DonHatch How about reporting this here: https://github.com/boostorg/container/issues ? (it may be giving a count of the number of comparisons, but that is indeed misleading if not accompanied by a count of the number of moves) – Marc Glisse Jun 14 '19 at 06:55
  • @DonHatch. I don't see what you said that `random insertion takes logarithmic time`. It said `Logarithmic search time plus linear insertion ...`, just the same as your claim, maybe the doc has changed. BTW, I don't see the flat_map is much more outstanding than std::vector + std::sort + std::binary_search. – HarryLeong Feb 25 '21 at 18:38
  • @HarryLeong Looking again. If I follow the same link I gave above: https://www.boost.org/doc/libs/1_70_0/doc/html/boost/container/flat_map.html I see 14 different member functions with "insert" in their names, plus six with "emplace" in their names, plus two operator[]'s. Some of these 22 member functions claim "logarithmic search time plus linear insertion" as you said; most of them just claim "logarithmic" which is wrong. And it looks to me like some but not all of the wrong ones have been fixed in more recent doc ("Click here to view this page for the latest version", to 1_75_0). – Don Hatch Feb 26 '21 at 00:32
  • @HarryLeong You say "BTW, I don't see the flat_map is much more outstanding than std::vector + std::sort + std::binary_search." Agreed. I don't see that flat_map provides any value at all, let alone enough to be worth the effort of trying to fix or maintain its extensive documentation. – Don Hatch Feb 26 '21 at 00:49

2 Answers2

228

I have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.

Benchmarking

On the web, we rarely find (if ever) a well-engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).

1) You need to consider cache warming

Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.

The truth is, in the real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only. Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).

2) RDTSC accuracy measure

I also recommend doing this:

u64 g_correctionFactor;  // number of clocks to offset after each measurement to remove the overhead of the measurer itself.
u64 g_accuracy;

static u64 const errormeasure = ~((u64)0);

#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // flush OOO instruction pipeline
    return __rdtsc();
}

inline void WarmupRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // warmup cpuid.
    __cpuid(a, 0x80000000);
    __cpuid(a, 0x80000000);
    
    // measure the measurer overhead with the measurer (crazy he..)
    u64 minDiff = LLONG_MAX;
    u64 maxDiff = 0;   // this is going to help calculate our PRECISION ERROR MARGIN
    for (int i = 0; i < 80; ++i)
    {
        u64 tick1 = GetRDTSC();
        u64 tick2 = GetRDTSC();
        minDiff = std::min(minDiff, tick2 - tick1);   // make many takes, take the smallest that ever come.
        maxDiff = std::max(maxDiff, tick2 - tick1);
    }
    g_correctionFactor = minDiff;

    printf("Correction factor %llu clocks\n", g_correctionFactor);

    g_accuracy = maxDiff - minDiff;
    printf("Measurement Accuracy (in clocks) : %llu\n", g_accuracy);
}
#endif

This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid getting a -10**18 (64 bits first negatives values) from time to time.

Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real-world stuff, especially when calling stuff just once. So an intrinsic is suited here because it doesn't break the compiler free-re-ordering of instructions.

3) parameters

The last problem is people usually test for too few variations of the scenario. A container performance is affected by:

  1. Allocator
  2. size of the contained type
  3. cost of implementation of the copy operation, assignment operation, move operation, construction operation, of the contained type.
  4. number of elements in the container (size of the problem)
  5. type has trivial 3.-operations
  6. type is POD

Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT "new" or some user-defined operation, like pool allocation or freelist or other...

(for people interested about pt 1, join the mystery thread on gamedev about system allocator performance impact)

Point 2 is because some containers (say A) will lose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and lose for larger types.

Point 3 is the same as point 2, except it multiplies the cost by some weighting factor.

Point 4 is a question of big O mixed with cache issues. Some bad-complexity containers can largely outperform low-complexity containers for a small number of types (like map vs. vector, because their cache locality is good, but map fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to "leak" to main memory and cause cache misses, that plus the fact that the asymptotic complexity can start to be felt.

Point 5 is about compilers being able to elide stuff that are empty or trivial at compile time. This can optimize greatly some operations because the containers are templated, therefore each type will have its own performance profile.

Point 6 same as point 5, PODs can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.

About the flat map

Apparently, the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.

This is still an ordered container. Most people usually don't need the ordering part, therefore the existence of unordered...

Have you considered that maybe you need a flat_unorderedmap? which would be something like google::sparse_map or something like that—an open address hash map.

The problem of open address hash maps is that at the time of rehash they have to copy everything around to the new extended flat land, whereas a standard unordered map just has to recreate the hash index, while the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.

The criterion of a rehash in an open address hash map is when the capacity exceeds the size of the bucket vector multiplied by the load factor.

A typical load factor is 0.8; therefore, you need to care about that, if you can pre-size your hash map before filling it, always pre-size to: intended_filling * (1/0.8) + epsilon this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.

The advantage of closed address maps (std::unordered..) is that you don't have to care about those parameters.

But the boost::flat_map is an ordered vector; therefore, it will always have a log(N) asymptotic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.

Benchmark results

This is a test involving different maps (with int key and __int64/somestruct as value) and std::vector.

tested types information:

typeid=__int64 .  sizeof=8 . ispod=yes
typeid=struct MediumTypePod .  sizeof=184 . ispod=yes

Insertion

EDIT:

My previous results included a bug: they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results later down this page because they are interesting.
This is the correct test: random insert 100

random insert 10000

I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:

map: O(N * log(N))
hashmaps: O(N)
vector and flatmaps: O(N * N)

Warning: hereafter the 2 tests for std::map and both flat_maps are buggy and actually test ordered insertion (vs random insertion for other containers. yes it's confusing sorry):
mixed insert of 100 elements without reservation

We can see that ordered insertion, results in back pushing, and is extremely fast. However, from the non-charted results of my benchmark, I can also say that this is not near the absolute optimality for a back-insertion. At 10k elements, perfect back-insertion optimality is obtained on a pre-reserved vector. Which gives us 3Million cycles; we observe 4.8M here for the ordered insertion into the flat_map (therefore 160% of the optimal).

mixed insert of 10000 elements without reservation Analysis: remember this is 'random insert' for the vector, so the massive 1 billion cycles come from having to shift half (in average) the data upward (one element by one element) at each insertion.

Random search of 3 elements (clocks renormalized to 1)

in size = 100

rand search within a container of 100 elements

in size = 10000

rand search within a container of 10000 elements

Iteration

over size 100 (only MediumPod type)

Iteration over 100 medium pods

over size 10000 (only MediumPod type)

Iteration over 10000 medium pods

Final grain of salt

In the end, I wanted to come back on "Benchmarking §3 Pt1" (the system allocator). In a recent experiment, I am doing around the performance of an open address hash map I developed, I measured a performance gap of more than 3000% between Windows 7 and Windows 8 on some std::unordered_map use cases (discussed here).
This makes me want to warn the reader about the above results (they were made on Win7): your mileage may vary.

Anton Menshov
  • 2,266
  • 14
  • 34
  • 55
v.oddou
  • 6,476
  • 3
  • 32
  • 63
  • Does `QueryPerformanceCounter` (or associated performance tools on other platforms) take care of this kind of thing for you? – Billy ONeal Jul 30 '14 at 02:18
  • `QueryPerformanceCounter` is actually quite a crazy machinery. And you cannot be sure of what it will use internally, it can use RDTSC, in this case it will normalize the readings so that it appears to have a stable 1Ghz frequency. It may use HPET and the call might be a bit lengthy because of that. Everything is written here: http://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx . I recommend `QueryPerfCntr` when you release code to clients. A benchmark is ran once on a controlled environment, use RDTSC, disable intel speed step, set thread prirority to time_critical – v.oddou Jul 30 '14 at 03:02
  • Also don't take it like `QPC` will shield you from any harm, because it won't. Some words from the maker of virtual dub: http://www.virtualdub.org/blog/pivot/entry.php?id=106 . and a question from a person who notices a 1.4ms delay to return from `QPC`, this would completely break the benchmark : http://stackoverflow.com/questions/1723629/what-happens-when-queryperformancecounter-is-called – v.oddou Jul 30 '14 at 03:16
  • Your vector insert results include a sort or similar, right? I'd expect vector::push_back to beat map::insert – Billy ONeal Jul 31 '14 at 05:40
  • @BillyONeal: You're absolutely right. I also realized after that these insertion results are super weird. There has to be a bug in my benchmark. I leave it like that for the moment, but I'm definitely going to investigate the code right now. I have a release going on though, so my time is scarse... raah – v.oddou Jul 31 '14 at 06:08
  • I have re-run the benchmark, re checkd the code. everything seems fine. there is something i don't understand. The back-insertion with reservation in the vector takes 700,000 cycles, without pre-reserve it takes 3 Millions, so still faster than the 4.7 Millions of the boost::flatmap. However, the random insertion persists in being crazy slow. I use vector::insert with an iterator that I take from begin, call std::advance on it, with a random number. I would need to step-IN the stuffs to see what's going on and explain that. – v.oddou Jul 31 '14 at 06:53
  • 1
    oh, in that case it makes sense. Vector's constant amortized time guarantees only apply when inserting at the end. Inserting at random positions should average O(n) per insert because everything after the insertion point must be moved forward. So we would expect quadratic behavior in your benchmark which blows up quite fast, even for small N. The AssocVector style implementations are probably deferring sort until a lookup is required, for example, rather than sorting after every insert. Hard to say without seeing your benchmark. – Billy ONeal Jul 31 '14 at 07:02
  • OMG, I never thought of deferring the sort, that's amazing. That makes a lot of sense. Then I wonder why the boost documentation says that insertion is linear... – v.oddou Jul 31 '14 at 07:22
  • 1
    @BillyONeal: Ah, we inspected the code with a colleague, and found the culprit, my "random" insertion was ordered because I used a std::set to make sure the inserted keys were unique. This is a plain imbecility, but I fixed that with a random_shuffle, I'm rebuilding now and some new results will appear soon as an edit. So the test in its current state prooves that the "ordered insertion" is damn fast. – v.oddou Jul 31 '14 at 08:05
  • Did you turn off CPU frequency scaling and turbo boost while benchmarking though? In Windows if you set it to "performance" it locks into turbo boost till the CPU overheats. – Maxim Egorushkin Jul 31 '14 at 08:43
  • @MaximYegorushkin: I set to "high performance", I verified with Core Temp that the frequency is locked to 3.6Ghz, But the turbo boost makes it vary to 3.7 from time to time. However, I'm not completely sure that it is a problem to let the CPU change frequency, since I measure everything in clock cycles, whatever the frequency, it won't change the results :) right ? – v.oddou Jul 31 '14 at 08:55
  • Ah no, it will change the behavior versus memory latency, a lower frequency will appear to have better results as soon as we tests stuff that get out of the cache. alright well, I fixed my frequency to between 3.6 and 3.7 so it should be OK anyway. – v.oddou Jul 31 '14 at 08:57
  • 4
    "Intel has a paper" ←and [here](http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf) it is – isomorphismes Aug 17 '15 at 17:34
  • @v.oddou, vector is missing in the random search graph. – Surt Sep 28 '15 at 11:48
  • The Intel paper you reference states, "Assumptions: In this paper, all the results shown were obtained by running tests on a **platform whose BIOS was optimized by removing every factor that could cause indeterminism**. All power optimization, Intel Hyper-Threading technology, frequency scaling and turbo mode functionalities were turned off." – ahcox Jun 26 '16 at 17:53
  • Hey, do you think you can make this benchmark suite available on github? I have a custom hash map implementation that I want to benchmark against these numbers. Thanks! – skgbanga Sep 09 '16 at 22:17
  • I would love to. But it is not my property :'( As a consolation maybe you can try putting your container in the bench bed of my open source version of the same thing that you did. Here: https://sourceforge.net/p/cgenericopenaddresshashmap/code/ci/master/tree It's not the same richness but still better than nothing. – v.oddou Sep 12 '16 at 04:03
  • 8
    Perhaps I'm missing something obvious, but I don't understand why the random search is slower on the `flat_map` compared to `std::map` - is anyone able to explain this result? – boycy Jan 11 '17 at 10:35
  • 1
    I would explain it as a specific overhead of the boost implementation of this time, and not an intrinsic character of the `flat_map` as a container. Because the `Aska::` version is faster than the `std::map` lookup. Proving that there is room for optimization. The expected performance is asymptotically the same, but maybe slightly better thanks to cache locality. With high size sets they should converge. – v.oddou Jan 12 '17 at 08:27
  • 1
    For inserting lots of elements into a `flat_set` (without any look-ups in between), one should better construct a new one [using the appropriate constructor](http://www.boost.org/doc/libs/1_63_0/doc/html/boost/container/flat_map.html#idp43147776-bb) (followed by a merge or swap) – chtz Apr 18 '17 at 21:14
  • Your abscissa labels don't make sense. How do you conclude from large number of clock cycles that `std::vector`'s back insertion is faster? Shouldn't clock cycles be proportional to time? – Ruslan Oct 13 '17 at 15:50
  • @Ruslan yes. that's because I haven't uploaded the chart of the particular test that allowed me to conclude that. But it had a 3 million clocks for the mediumSizedPod in the 10k test. – v.oddou Oct 16 '17 at 09:04
  • For asymptotic tendency of inserting N items into a hashmap, you don't need to say "amortized N"; it's just N. (Inserting 1 item is amortized 1, though.) – Don Hatch Jun 14 '19 at 00:56
  • @v.oddou Excellent job on presenting this work. I have a question. When you iterate over the std map, I assume its done in the ordering of the keys. Is it possible to iterate over the std map elements in their stored order, instead of searching through each branch of the binary tree to present them in key order, so that the iterator is unordered but much faster? – linuxfreebird Mar 23 '20 at 12:36
  • @linuxfreebird thanks! from memory the iteration is `begin(); ++; !=end()` (same as what a range for would generate). There might be a way to speed up iteration by using a very specially designed allocator. But with a default alloc, all nodes are all over the place in memory. There is no way to iterate other than following the pointers, I can think of. You see what I mean? – v.oddou Mar 24 '20 at 11:01
  • @v.oddou Thanks for the reply. I request you help with https://stackoverflow.com/questions/60815144/is-there-a-c-standard-library-container-of-size-n-that-has-logn-insertion-an . I think the BST I am looking for is implemented as an array like structure that encodes the pointers as indices in the array. The closet thing I found that does this is https://github.com/mpaland/avl_array . I am doing numerics on large arrays that need to double in size when inserting so memory allocation calls is kept to a minimum. I could use some help. – linuxfreebird Mar 24 '20 at 13:10
  • 1
    Good point about the random inserting. I still have here the case that a flat_map is faster than a map but they are now much closer. It seems that a flat_map is good in iteration and low on allocation cost but when you have to move heavy objects around in the container it can become awful. Perhaps that's even a tip of Chandler's talk (i.e. put heavy objects outside associative containers). – gast128 Jul 06 '20 at 22:51
  • Have you tried insert with hint for std::map? According to https://en.cppreference.com/w/cpp/container/map/insert, insert with a good hint may improve the performance for ordered element. – 欢乐的Xiaox Dec 30 '21 at 01:57
  • I wonder why you didn't find my `StopWatch` class... https://github.com/CarloWood/cwds/blob/master/benchmark.h#L134 Granted, its `measure` member function returns something from a warmed cache (many repeats); but if you just start() and stop() you get exactly what you were interested in. – Carlo Wood Apr 09 '22 at 20:07
  • Has anybody compare flat_map with pmr::map? – macomphy Nov 01 '22 at 01:46
8

From the docs it seems this is analogous to Loki::AssocVector which I'm a fairly heavy user of. Since it's based on a vector it has the characteristics of a vector, that is to say:

  • Iterators gets invalidated whenever size grows beyond capacity.
  • When it grows beyond capacity it needs to reallocated and move objects over, ie insertion isn't guaranteed constant time except for the special case of inserting at end when capacity > size
  • Lookup is faster than std::map due to cache locality, a binary search which has the same performance characteristics as std::map otherwise
  • Uses less memory because it isn't a linked binary tree
  • It never shrinks unless you forcibly tell it to ( since that triggers reallocation )

The best use is when you know the number of elements in advance ( so you can reserve upfront ), or when insertion / removal is rare but lookup is frequent. Iterator invalidation makes it a bit cumbersome in some use cases so they're not interchangeable in terms of program correctness.

Ylisar
  • 4,293
  • 21
  • 27
  • 2
    false :) measurements above show map being faster than flat_map for find operations, I guess boost ppl need to fix the implementation, but in theory you are right. – NoSenseEtAl Dec 19 '14 at 23:20