Does this lock-free reference-counting wrapper need additional fences or volatiles?

Question

As a first hands-on adventure into lock-free code (I've only read about it up until now), I'd like to try and build a lock-free reference-counting wrapper for IDisposable classes.

Here is the actual lock-free nested class:

private sealed class Wrapper
{
    public T WrappedObject { get; private set; }
    private int refCount;

    public Wrapper(T objectToWrap)
    {
        WrappedObject = objectToWrap;
        refCount = 1;
    }

    public void RegisterShare()
    {
        Interlocked.Increment(ref refCount);
    }
    public bool TryRegisterShare()
    {
        int prevValue;
        do {
            prevValue = refCount;
            if (prevValue == 0)
                return false;
        } while (prevValue != Interlocked.CompareExchange(ref refCount, prevValue + 1, prevValue));
        return true;
    }
    public void UnregisterShare()
    {
        if (Interlocked.Decrement(ref refCount) <= 0)
            WrappedObject.Dispose();
    }
}

It's a private nested class so I can ensure the methods only get called for the following purposes:

constructor for creating the first strong reference
RegisterShare for duplicating strong references
UnregisterShare for releasing strong references
TryRegisterShare for promoting weak references into strong ones

I believe I got the basic idea, I'm just not sure if this is truly thread-safe. A question that crosses my mind: in TryRegisterShare, is the first assignment to prevValue guaranteed to be larger than zero unless all strong references were released? Do I need some fencing or volatiles?

I don't believe the outer class that handles sharing of the references is important for this, but you can find it here if anyone's interested: https://codepaste.net/zs7nbh

Updates:

Here's the modified code taking into account what @PeterCordes had to say.

private sealed class Wrapper
{
    public T WrappedObject { get; private set; }
    private int refCount;

    public Wrapper(T objectToWrap)
    {
        WrappedObject = objectToWrap;
        refCount = 1;
    }

    public void RegisterShare()
    {
        Interlocked.Increment(ref refCount);
    }
    public bool TryRegisterShare()
    {
        return Interlocked.Increment(ref refCount) > 0;
    }
    public void UnregisterShare()
    {
        if (Interlocked.Decrement(ref refCount) == 0
            && Interlocked.CompareExchange(ref refCount, int.MinValue, 0) == 0)
        {
            WrappedObject.Dispose();
        }
    }
}

score 1 · Accepted Answer · edited May 23 '17 at 12:19

Caveat: I don't know C#, but I do know C++ with std::atomic and atomic stuff in x86 asm, and your code (and the function/method names) seem pretty readable/clear so I think I understand what's going on.

You're implementing something similar to C++11's std::shared_ptr, so you could look at implementations of it for inspiration. (This class is like the back-end control block that separate shared_ptr objects sharing the same pointer refer to.)

If two threads both run UnregisterShare and bring the ref count down below zero, they'll both try to .Dispose(). That's a bug, similar to double-free or double-unlocking. You could check for it, or paper it over by changing the code to == instead of <= so only one thread thread runs .Dispose(). <= looks like the worst of both worlds: misbehaviour that could be hard to identify.

in TryRegisterShare, is the first assignment to prevValue guaranteed to be larger than zero unless all strong references were released?

Barring bugs like failure to call RegisterShare when taking a reference, or double-release bugs, yes I think so. It would be prudent to use if(prevValue <= 0) return false there, to make sure you bail out on double-release situations where something has left the refcount negative.

A cmpxchg loop doesn't seem ideal, but if you unconditionally increment and just check whether you must have started at zero, that could fool other threads. (e.g. this sequence of events:

thread A->Unshare
thread B->TryShare (failure detected, but leaves refcount=1 temporarily)
thread C->TryShare succeeds
thread A->Dispose

I haven't looked at how (the Linux / gcc implementation of) C++11 weak_ptr.lock() implements promotion to a shared_ptr (and I'm curious now!).

I wonder if they use cmpxchg in a loop (in the asm) or if they do some kind of dance that avoids that. e.g. maybe Unshare, upon detecting refcount==0, could use a cmpxchg loop to modify the refcount from zero to -2^31 before calling Dispose. (And if it detects refcount >= in that loop, it stops trying to kill it.) Then I think TryShare would succeed as long as it saw Interlocked.Increment(ref refCount) >= 1, since that means that any running UnShare's cmpxchg hasn't already succeeded, and won't succeed.

For some use-cases, it might not be desirable for TryShare to succeed in that case. You could just have it call Unshare to decrement the count again if it it's increment found the old value was zero.

So I think there needs to be a cmpxchg loop somewhere, but if my logic is correct, you can get away with putting it in the refcount-drops-to-zero path of UnShare instead of in the presumably-hotter TryRegisterShare.

RegisterShare looks safe if you can absolutely guarantee that the refcount wasn't zero beforehand. This should be the case in a thread which already has a ref. For bug-checking, you could check the return value of the increment to catch cases where you accidentally reanimate a dead object (i.e. on that another thread is about to (or may have already) Disposed.

In C++, if multiple threads were reading and writing the same shared_ptr, this assumption could be violated. It would require a std::atomic<std::shared_ptr<int>> foo; for that to be safe, though, and that won't compile because shared_ptr is not trivially constructible.

So C++ protects itself from unlocked concurrent access to the reference wrapper objects (by declaring it Undefined Behaviour); you should, too. Otherwise another thread with a reference to the same shared_ptr object this thread is copying might have called .reset() on it, and so might decrement the refcount in the control block after this thread reads the pointer to the control block, but right before this thread increments the control-block refcount.

I just noticed that the title question is about needing extra fences or volatiles:

The decrement in UnShare needs to become globally visible before anything that .Dispose() does, and it needs to be a release operation to make sure loads/stores into a shared object become globally visible before the decrement of the refcount.

Your code already achieves that, because Interlocked.anything has sequential-consistency semantics in .NET on any architecture (not just on Windows), so nothing can reorder across it in either direction.

AlexRP's blog post about the .NET memory model explains lots of detail about this an related things (Interlocked, Thread.VolatileRead/Write, Volatile.Read/Write, and that simple loads or stores on types up to the width of IntPtr.Size are automatically atomic (but not synchronized).) Fun fact: In Microsoft's implementation for x86, Thread.VolatileRead and Write are surrounded on either side by MFENCE instructions, but the language standard only requires acquire or release semantics (which x86 does for free with no barrier). Mono doesn't use barriers for these functions, but MS can't remove them until everyone's buggy code stops relying on that implementation detail.

Other than the decrement, I think everything else just needs atomicity for your ref-counting, not synchronization of any other operations. The ref-counting doesn't synchronize access to the shared object from multiple threads, so a TryShare/UnShare pair doesn't form a critical section (which needs acquire/release semantics). From the constructor to the final UnShare that results in a .Dispose(), this class is hands-off as far as synchronization. In C++ I think you could use memory_order_relaxed for the RegisterShare increment and the TryShare cmpxchg. On some weakly-ordered architectures, like ARM, this would require fewer barriers. (That's no an option in C#: you only have the sequential-consistency Interlocked operations which require full barriers on ARM. There's no extra cost on x86, though, since locked read-modify-write instructions on x86 are already full memory barriers like MFENCE (Thread.MemoryBarrier))

Yup, I was inspired by the idea of a shared_ptr for this. But I didn't look at the implementations, you got me on that. Count can't go below zero because Unregister only gets called once per construction or a successful register. I will change the end-conditions just in case, though. The cmpxchg loop was originally just an increment, yes, but then I noticed the problematic sequence you mentioned. Count is never zero when RegisterShare is called because it only gets called when duplicating an already existing strong reference. I'll go read through your points more deeply now. :) — relatively_random, Oct 27 '16 at 12:26
@relatively_random: I just added another couple paragraphs about that last case of duplicating strong refs, and how people misusing your class could break it. I was thinking at first that C++ did have to worry about this, because passing around references to a `shared_ptr` instead of copies by value is common and much more efficient. But then I realized that's only allowed within one thread, because it's only the ref-counting that's atomic, not the shared_ptr object itself. — Peter Cordes, Oct 27 '16 at 12:42
The `Wrapper` is a private nested class within a `DisposableShare` class (as can be seen in the link I provided). Usage notes for the publicly available `DisposableShare` say that it's not thread-safe and that every thread should have a separate "clone". This seems in line with what you say about C++'s `shared_ptr`. — relatively_random, Oct 27 '16 at 12:56
I'm thinking about your suggestion about cmpxchg. Doesn't it just move the loop from TryRegister to Unregister? If I understand it correctly, the question is only which I want to be faster: disposals or promotions, right? Disposal does only ever happen once per managed object, so that seems like a sensible trade. — relatively_random, Oct 27 '16 at 13:13
@relatively_random: Explicitly declaring your wrapper non-thread-safe sounds good. That should cover it. With those limitations, I don't see any problem with your `RegisterShare`. (But that's not a guarantee that there aren't any! I didn't look super-hard). BTW, I just added an update about fencing / memory-ordering. (Mostly you don't need any for ref-counting, but .Dispose needs to be ordered). — Peter Cordes, Oct 27 '16 at 13:19
@relatively_random: re: cmpxchg. Yes, that's exactly right. I tried to say all that in the summary paragraph at the end of that section, so I'm glad you understood. :) — Peter Cordes, Oct 27 '16 at 13:21
Updated the post to incorporate your suggestions. Not sure if this use of CompareExchange without a loop is right, though. If it does guarantee an exchange when 0 is still present, I believe it should work since the only thing that would cause the count to move from zero is an upcount. In that case, the object that increased the count will drop it and see a zero again. — relatively_random, Oct 27 '16 at 13:37
According to https://msdn.microsoft.com/en-us/library/windows/desktop/ms684122.aspx, the underlying Windows Interlocked functions provide full barriers for all three operations I'm using. Unfortunately, .Net's Interlocked class provides no explicit guarantees like that. If I were to put manual memory barriers just to be sure, where would I do that? And would the compiler be able to optimize two sequential barriers to a single one in case .Net already provides them implicitly? — relatively_random, Oct 27 '16 at 13:47
@relatively_random: A StoreStore barrier is a no-op on x86, so a manual StoreStore barrier should compile to zero instructions. (SFENCE is only needed when using NT stores). A manual full-barrier there will probably compile to an MFENCE separate from the `lock dec`. (Or `lock xadd` if the compiler doesn't realize that it can just test flags set be `lock dec` instead of using a fetch_add.) I don't use Window, C#, or .Net at all, so I'm not the person to answer that question any further :/ — Peter Cordes, Oct 27 '16 at 13:59
Unfortunately, .Net only provides explicit calls to full memory barriers so it's not a no-op. I'll think some more about adding it in. Thanks for your help! — relatively_random, Oct 27 '16 at 14:05
@relatively: This was bugging me so I googled it up: "All operations provided on the `Interlocked` class guarantee sequential consistency and are atomic everywhere.", according to a (long and detailed) [blog post by alexrp](http://blog.alexrp.com/2014/03/30/dot-net-atomics-and-memory-model-semantics/) about the .Net memory model. alexrp is a Mono developer, so he actually talks about the difference between what Microsoft's implementation actually does vs. what the .Net language standard requires: e.g. `Thread.VolatileWrite` is surrounded by two MFENCE on MS, and buggy code depends on it — Peter Cordes, Oct 29 '16 at 03:17
@relatively_random: updated my answer with a new final section. I'm still pretty sure you're totally fine WRT synchronization, even on ARM. — Peter Cordes, Oct 29 '16 at 03:57

Does this lock-free reference-counting wrapper need additional fences or volatiles?

Updates:

1 Answers1