Difference between atomic decref implementations

Question

I've been looking into implementations of atomic reference counting.

Most of the operations are very consistent between libraries, but I've found a surprising variety in the "decrease refcount" operation. (Note that, generally, the only difference between shared and weak decref is which on_zero() is called. Exceptions are noted below.)

If there are other implementations implemented in terms of C11/C++11 model (what does MSVC do?), other than the "we use seq_cst because we don't know any better" kind, feel free to edit them in.

Most of the examples were originally C++, but here I've rewritten them to C, inlined and normalized to the >= 1 convention:

#include <stdatomic.h>
#include <stddef.h>
typedef struct RefPtr RefPtr;
struct RefPtr {
    _Atomic(size_t) refcount;
};
// calls the destructor and/or calls free
// on a shared_ptr, this also calls decref on the implicit weak_ptr
void on_zero(RefPtr *);

From Boost intrusive_ptr examples and openssl:

void decref_boost_intrusive_docs(RefPtr *p) {
    if (atomic_fetch_sub_explicit(&p->refcount, 1, memory_order_release) == 1) {
        atomic_thread_fence(memory_order_acquire);
        on_zero(p);
    }
}

It would be possible to use memory_order_acq_rel for the fetch_sub operation, but this results in unneeded "acquire" operations when the reference counter does not yet reach zero and may impose a performance penalty.

But most others ( Boost, libstdc++, libc++ shared ) do something else:

void decref_common(RefPtr *p) {
    if (atomic_fetch_sub_explicit(&p->refcount, 1, memory_order_acq_rel) == 1)
        on_zero(p);
}

But libc++ does something different for the weak count. Curiously, this is in an external source file:

void decref_libcxx_weak(RefPtr *p) {
    if (atomic_load_explicit(&p->refcount, memory_order_acquire) == 1)
        on_zero(p);
    else
        decref_common(p);
}

The question, then is: what are the actual differences?

Sub-questions: Are the comments wrong? What do specific platforms do (on aarch64, would ldar be cheaper than dmb ishld? also ia64?)? Under what conditions can weaker versions be used (e.g. if the dtor is a nop, if the deleter is just free, ...)?

See also Atomic Reference Counting and Why is an acquire barrier needed before deleting the data in an atomically reference counted smart pointer?

Only doing `release` unconditionally and `acquire` conditionally may be a performance benefit, but OTOH, I'd suspect many `shared_ptr`s only ever have a single reference, so a branch predictor would be trained to execute the `acquire` speculatively anyway. If there is an efficient instruction for the `atomic_fetch_sub_explicit(acq_rel)`, the combined operation would then be better. — EOF, May 05 '19 at 07:41
@EOF So your guess is that most `shared_ptr` are used as heavy, relatively complex, type erased `unique_ptr`? — curiousguy, May 08 '19 at 04:25
@curiousguy I wouldn't be surprised. And why not? Most of the time the inefficiency of `shared_ptr` isn't going to be the bottleneck in terms of speed, so until it shows up in a profile, who cares? — EOF, May 08 '19 at 16:22
@EOF Why use an inherently more powerful and flexible tool when a more limited tool expresses your design better, has a stricter invariant and is hence more powerful and flexible in another way? Unique ownership inherently *allows the release of that ownership*. By returning a `shared_ptr` from a function, you deprive your users of that flexibility, they are stuck with `shared_ptr`: once managed by `shared_ptr`, always managed by `shared_ptr`! — curiousguy, May 09 '19 at 02:24
@curiousguy I'm not saying that you *shouldn't* use `unique_ptr` where appropriate, I'm saying that it probably *isn't* being used consistently, and that it *probably* doesn't matter most of the time. — EOF, May 09 '19 at 15:50
I imagine the argument against returning `unique_ptr` is that in a public/general-purpose API you don't know who is going to be calling your function, and therefore you can't guarantee that they won't need the ability to share the returned object across multiple owners -- therefore you return `shared_ptr` just to make sure all use-cases are covered, at the expense of a small amount of performance. (Dunno if that's a *good* argument, but I think it's probably a common one) — Jeremy Friesner, Jun 11 '19 at 00:45
@JeremyFriesner Of course the library code that create doesn't know whether multiple owners or a single owner will be appropriate for the user, and this is exactly why it should always return a `unique_ptr`. — curiousguy, Jun 11 '19 at 02:40
Aha— I was not aware that it was possible to convert a unique_ptr to a shared_ptr. :) — Jeremy Friesner, Jun 11 '19 at 02:45
A `shared_ptr` constructed from a `unique_ptr` requires 2 allocations instead of 1 (if `make_shared` is used), which is often considered worse. BUT, it's not clear that sharing the cache line between the counter and the object is actually a win. — o11c, Jun 11 '19 at 03:18

score 1 · Answer 1 · answered May 05 '19 at 08:24

1

The libc++ choice is documented in the source code:

NOTE: The acquire load here is an optimization of the very common case where a shared pointer is being destructed while having no other contended references.

libc++ coder observed that most of the time, when the last shared_ptr is destroyed there is no weak_ptr referencing the shared object. As far as I know, and at least on x86, read-modify-write instructions are much more expansive than a read instructions. So, for the most common case, they decided to avoid to perform an expansive and unusefull read-modify-write. Other implementation of the standard library does not perform this optimization.

answered May 05 '19 at 08:24

Oliv

17,610
1
29
72

Hm, but that code applies to *all* weak pointers, if it was inlined the extra test could be avoided. Plus, comments lie. – o11c May 05 '19 at 23:48
@o11c From what I have seen, optimization are praticaly disabled when a set of expression involves atomic expressions. – Oliv May 06 '19 at 06:53
2

@curiousguy Indeed, for fun, look that horror: https://godbolt.org/z/yQ35Fp !!!!! Atomics are treated as volatile atomic. There is a confusion between the concept of atomic load and the concept of side effects. The standard is clear about that. There are still poeple preaching that a volatile atomic does not make sense. And those preaches last since such a long time that we can expect that this kind of code pessimization will be always there! – Oliv May 08 '19 at 06:07
@Oliv Fascinating compilation. Who comes up with these rules? – curiousguy May 08 '19 at 08:11
@Oliv if that was an `acquire` load it wouldn't be legal to remove, right? – o11c May 08 '19 at 17:08
@o11c For this specific trivial case it would not change anything. But I agree, that for more complex case, memory ordering could influence the optimizations that a smart compiler could perform. – Oliv May 08 '19 at 17:20
@curiousguy By curiosity, I am going to ask a genuine question on SO, "Why are my unnecessary atomic loads are not optimized away?" as a probing. – Oliv May 08 '19 at 17:47
@curiousguy For now my question has 52 views, and a comment saying that an atomic load is "asking for an access to memory" has only 3 upvotes. So only 8% somehow think atomic implies volatile. I must admit I am probably wrong about the wrong preaches, they do not have the influence I expected! – Oliv May 08 '19 at 18:27
@o11c Why wouldn't throw away acq loads be removed? – curiousguy May 09 '19 at 02:13
@curiousguy because an `acquire` has hardware side-effects, it makes other stores visible – o11c May 09 '19 at 04:53
@o11c That's the exact same mistake you noticed previously: **assuming volatile semantics (direct C/C++ mapping to asm)** = a mapping to hardware of an abstract C/C++ functionality specified in term of abstract semantic and which is not an observable of the program execution. (Each thread has no observable behavior, only whole program executions have.) If you want HW side-effect, use an asm-declaration (BTW why is that even called a declaration? it declares nothing!) or do a syscall (which is also defined in term of asm). This remark also applies to the use of atomics in Java to create fences – curiousguy May 09 '19 at 05:03
**The sole purpose of acq load is to use the value to determine the progress storing thread**: acq ops are information gathering ops (an acq fence is "anonymous" that is: not linked to a particular object, but it applies to all earlier atomic objects accesses). **If you are not going to gather any info while doing an acq, you might have well not do it at all as you wouldn't know how much you are synchronizing with**. Maybe you confused a load acq on a atomic, which is not anonymous and only acq on that object with an anonymous fence. In C/C++, only the explicit fences do explicit fences. – curiousguy May 09 '19 at 05:16
IOW, for an isolated function (context of the use unknown), you can't ever rewrite the code to remove all uses of a given explicit fencing direction: can't remove all acq explicit fences is there is a single one, etc. So a **separately compiled** function containing a fence will generate specific assembly code. ("Separate" means that no global program optimization is allowed here.) Of course global optimization might remove useless fencing. – curiousguy May 09 '19 at 05:24
@curiousguy "_If you are not going to gather any info while doing an acq_" **That is WRONG in general**. It assumes that the particular operation is the only one on the atomic object. There may be another previous **relaxed** operation that is any reading (load or RMW) operation that wasn't _at least_ an acquire whose result wasn't thrown away and the next throw-away acquire (`(void)x.load()`) will make the give **the previous** load acquire semantics. I'm SORRY! – curiousguy May 24 '19 at 20:27
@o11c See update. I was oversimplifying. Still in `x.load();x.load();x.load();` all acquire operations after the first one come not after a relaxed load and are useless and can be ignored. – curiousguy May 24 '19 at 20:29

Difference between atomic decref implementations

1 Answers1