Why doesn't the libc++ shared_ptr implementation split acq_rel fetch_sub into release fetch_sub and acquire fence?

Question

libc++ shared_ptr implementation release() for the sake of simplicity can be depicted as:

void release()
{
    if (ref_count.fetch_sub(1, std::memory_order_acq_rel) == 1)
    {
        delete this;
    }
}

Why doesn't libc++ split it into release decrement and acquire fence?

void release()
{
    if (ref_count.fetch_sub(1, std::memory_order_release) == 1)
    {
        std::atomic_thread_fence(std::memory_order_acquire);
        delete this;
    }
}

as Boost recommends, which looks superior as it doesn't impose acquire mem order for all but the last decrement.

Related: https://stackoverflow.com/questions/72606895/does-it-make-sense-to-use-a-relaxed-load-followed-by-a-conditional-fence-if-i-d. Comments suggest that in some cases, relaxed load + acquire fence can be more expensive than acquire load. So it may depend on which is considered to be the more common case. I do imagine there is a lot of code that uses `shared_ptr` but never actually shares it. — Nate Eldredge, Jun 17 '22 at 14:45
thanks @NateEldredge, this may be a reason. I'm more interested if it's correct to use rel decrement + acq fence for shared ptr's reference controller. I'm working with a shared ptr implementation that uses this and TSAN reports a race between T1 `fetch_sub` and T2 destruction of the reference controller — Andriy Tylychko, Jun 17 '22 at 14:52
The link is to an archived old version of libc++. The current one is at https://github.com/llvm/llvm-project/blob/main/libcxx/include/__memory/shared_ptr.h#L121, but nothing changed about the approach. — user17732522, Jun 17 '22 at 15:10
It's possibly related to `std::atomic_thread_fence` being a more restrictive fence than one tied to an operation. — Drew Dormann, Jun 17 '22 at 15:19
@DrewDormann theoretically or is there any platform where it’s faster than two acq loads? Two comes from the assumption that usually there’re at least two refs — Andriy Tylychko, Jun 17 '22 at 16:39
@NateEldredge *'I do imagine there is a lot of code that uses shared_ptr but never actually shares it.*' – well, I really hope this didn't have any influence on the decision!!! Using some tool wrong, even if majority does, shouldn't *ever* punish those using it correctly... — Aconcagua, Jun 17 '22 at 16:40
@Aconcagua: That's true, but in a program where *some* `shared_ptr` objects are actually shared, but many others aren't, the non-shared case is still the common one. But the code-gen would still have to be correct for actually-shared objects, so you can't just avoid the work entirely. — Peter Cordes, Jun 18 '22 at 03:43
Both versions are certainly correct, and you could write out a proof if you are concerned. TSAN doesn't properly recognize the effect of fences, see https://stackoverflow.com/questions/70542993/why-does-the-thread-sanitizer-complain-about-acquire-release-thread-fences, so it's not surprising that it would report a false positive data race. — Nate Eldredge, Jun 18 '22 at 13:22
But if one was going to choose based on performance, it would have to be a heuristic. On a machine where an acquire fence is slower than an acquire load, one would have to trade it off against the relative frequency of the two cases, and I think it's very plausible that the unconditional acquire load could be faster on average than the conditional acquire fence. And of course, on other machines it makes no difference at all. On x86 every atomic RMW is automatically `seq_cst` and `std::atomic_thread_fence(std::memory_order_acquire)` is a no-op. — Nate Eldredge, Jun 18 '22 at 13:32

Why doesn't the libc++ shared_ptr implementation split acq_rel fetch_sub into release fetch_sub and acquire fence?

0 Answers0