What's the overhead from shared_ptr being thread-safe?

Question

std::shared_ptr is guaranteed to be thread-safe. I don't know what mechanism the typical implementations use to ensure this, but surely it must have some overhead. And that overhead would be present even in the case that your application is single-threaded.

Is the above the case? And if so, does that means it violates the principle of "you don't pay for what you don't use", if you aren't using the thread-safety guarantees?

From memory, the Loki library has smart pointers with a thread-safety policy, addressing this concern. — Tony Delroy, Sep 11 '14 at 06:50
FYI, http://stackoverflow.com/questions/15129263/is-there-a-non-atomic-equivalent-of-stdshared-ptr-and-why-isnt-there-one-in — Akira Takahashi, Sep 11 '14 at 09:57

Shafik Yaghmour · Answer 1 · 2014-09-16T21:14:18.200

If we check out cppreference page for std::shared_ptr they state the following in the Implementation notes section:

To satisfy thread safety requirements, the reference counters are typically incremented and decremented using std::atomic::fetch_add with std::memory_order_relaxed.

It is interesting to note an actual implementation, for example the libstdc++ implementation document here says:

For the version of shared_ptr in libstdc++ the compiler and library are fixed, which makes things much simpler: we have an atomic CAS or we don't, see Lock Policy below for details.

The Selecting Lock Policy section says (emphasis mine):

There is a single _Sp_counted_base class, which is a template parameterized on the enum __gnu_cxx::_Lock_policy. The entire family of classes is parameterized on the lock policy, right up to __shared_ptr, __weak_ptr and __enable_shared_from_this. The actual std::shared_ptr class inherits from __shared_ptr with the lock policy parameter selected automatically based on the thread model and platform that libstdc++ is configured for, so that the best available template specialization will be used. This design is necessary because it would not be conforming for shared_ptr to have an extra template parameter, even if it had a default value. The available policies are:

[...]

3._S_Single

This policy uses a non-reentrant add_ref_lock() with no locking. It is used when libstdc++ is built without --enable-threads.

and further says (emphasis mine):

For all three policies, reference count increments and decrements are done via the functions in ext/atomicity.h, which detect if the program is multi-threaded. If only one thread of execution exists in the program then less expensive non-atomic operations are used.

So at least in this implementation you don't pay for what you don't use.

"memory_order_relaxed Relaxed operation: there are no synchronization or ordering constraints, only atomicity is required of this operation." That does *not* sound right! — curiousguy, Aug 22 '15 at 05:24

score 4 · Answer 2 · answered Sep 11 '14 at 06:48

4

At least in the boost code on i386, boost::shared_ptr was implemented using an atomic CAS operation. This meant that while it has some overhead, it is quite low. I'd expect any implementation of std::shared_ptr to be similar.

In tight loops in high performance numerical code I found some speed-ups by switching to raw pointers and being really careful. But for normal code - I wouldn't worry about it.

answered Sep 11 '14 at 06:48

Michael Anderson

70,661
7
134
187

1

I'm pretty sure all major implementations utilize atomics. I know Microsoft's does, for one. – Violet Giraffe Sep 11 '14 at 06:49
@VioletGiraffe: Not all hardware platforms support lock-free atomic counters. Older ARMs don't, for example. – Mike Seymour Sep 11 '14 at 06:56
3

Atomic operations are pretty expensive. Last time I measured, I believe it was typically about 20 ns per atomic operation on a typical Intel CPU. On the order of 100 times slower than a regular operation. – Reto Koradi Sep 11 '14 at 07:25
1

"Pretty expensive" is a relative thing. Any I/O or system calls will typically swamp it - so it really only matters in tight loops. However on older systems where it falls back to mutexes, then you're talking another order of magnitude slower (IIRC boost used to do some pretty ugly things to try to keep that number down.) On those systems typically the implementation of `shared_ptr` depended on whether you had threading flags enabled or not - which then caused me all kinds of pain tracking down bugs from pieces compiled with different flags (as the internal layout of `shared_ptr` changed.) – Michael Anderson Sep 11 '14 at 07:31
@MikeSeymour "_Not all hardware platforms support lock-free atomic counters_" Can they still implement the various mutexes and other POSIX threading primitives? – curiousguy Aug 22 '15 at 05:25

What's the overhead from shared_ptr being thread-safe?

2 Answers2

Linked