How optimally efficient are the STL/standard C++ containers?

Question

I was reading this post on SO:

https://stackoverflow.com/a/3183607/997112

which is an answer to a performance question between C++ and C#. The poster is from a High Frequency trading background and says he wrote his own library of classes for HF work due to the search for nanosecond savings. Within his post he mentions that he uses very little of the C++ STL- which surprised me.

My question is- is the C++ STL completely optimized with regards to performance, or has it only been optimized for the average user? Would wrapping a few functions around a native array in C be faster than say Vector or List? Are there any containers within boost which have better performance?

I appreciate these classes will be fast enough for 99% of users- but my question is aimed towards usage within the other 1%.

I doubt whether anyone can say that any piece of software is "optimally efficient". There are usually data-specific optimizations that might suit the 1%, and then of course there is assembly language and even processor-specific optimizations. C++ is not about chasing after every clock cycle. — President James K. Polk, Nov 28 '12 at 00:07
Measure it. One big issue is that the problem and possible optimizations that one guy that falls into the 1% category needs can be totally different from what another guy in the 1% needs, and one fully optimized piece of code for one problem might run into someone else worst case scenario 100% of the time. "fully optimized for all imaginable use cases" does not exist. — nos, Nov 28 '12 at 00:08
To add to that, most of the optimizations for a particular 1% usage case scenario make the typical usage case perform considerably worse -- if for no other reason than the instruction cache thrashing due to larger code getting run for non-performance-critical tasks and slowing down other code that's actually performance-critical. — R.. GitHub STOP HELPING ICE, Nov 28 '12 at 00:18

score 6 · Answer 1 · answered Nov 28 '12 at 00:19

This question won't get any answer stating that "STL is optimal" for a number of reasons:

The are at least half a dozen implementations of the STL out there and there are certainly some which are not optimized while other are optimized (but see the next item).
Most optimizations you do make assumptions about usage patterns and will be pessimisations for some usage patterns. To really optimize for a particular use case, it requires knowledge of how something is used.
Most STL users misuse the components in one form or another and some proper uses only became really supported with C++ 2011 (e.g., using local custom allocators).
When people talk about the STL they typically refer to the example containers shipping with algorithms. Although useful, the focus of STL are the algorithms and these have a much better chance of being optimized. After all, the whole point of STL is that you can readily use the algorithms with your custom created container! Why bother optimizing containers? However, note that items 1 to 3 also applies to the algorithms.
STL has a good abstraction but isn't quite "there". In particular, when combining multiple algorithms better approaches exist than applying algorithms individually but the STL interface doesn't really support this.

All that said, it takes quite a bit of effort to come up with something better than the typical optimized STL implementation. I'd question assertions from anybody who wrote, say, a replace for std::vector<T, A> in a day and claims to be faster for typical use cases of std::vector<T, A> (referring to the C++ 2011 version where the allocator can be sensibly used).

score 2 · Answer 2 · answered Nov 28 '12 at 00:24

C++03 had a pretty glaring performance issue with copying, but that's fixed in C++11 with move semantics and perfect forwarding.

The spec defines the complexity of all the standard algorithms, so you can expect specific performance behavior on all implementations.

Often when people complain about stdlib performance, they're using the wrong tool for the job and just need to improve their algorithm. C++ provides quite a lot of stuff, but it never pretends that one thing works for everything.

Two things I've found stdlib collections lacking for:

When you want to make collections allocate from a pool or arena. Allocators (even stateful) just aren't flexible enough to do this with zero overhead in all cases.

Also, sometimes the most efficient way to implement something is to merge two collections. An example is for a cache, you might want to merge an unordered_map and a list for lookup and LRU respectively. The standard collections don't make this very efficient.

In both cases, Boost Intrusive provides the needed functionality.

score 2 · Answer 3 · answered Nov 28 '12 at 00:34

From actually reading the post, this bit stands out:

"It is ALL custom synchronization classes that only synchronize when empty, custom allocators, custom lock free queues and lists, occasional STL (with custom allocators) but more often custom intrusive collections (of which I have a significant library)."

The STL is set up for dealing with multiple threads that read, but of course requires locks and other synchronization primitives to deal with multiple writers. If the author truly has written custom lock free queues and lists, it is likely to give a fairly hefty performance boost over pure STL with locks. Custom allocators are likewise almost expected, the default std::allocator<T> is known to be a 'middle of the road' type solution. Depending on allocation patterns, utilizing an allocator with a memory pool could offer substantial speedup.

How optimally efficient are the STL/standard C++ containers?

3 Answers3