Why don't these lock-free reference counting implementations explode (or do they)?

Question

I'm trying to wrap my head around LFRC algorithms but I keep finding examples where I assume that they work, but I can't understand why. I'm focused on copying and deletion. There are three in particular that all follow the same general pattern of (pseudo-code):

Copy {
  shared = source->shared;
  if (shared != null) {
     Atomically_Increment_Counter(shared);
  }
}

Delete {
  if (shared != null) {
     oldcounter = Atomically_Decrement_Counter(shared);
     if (oldcounter == value signifying all references released) {
        Deallocate(shared);
     }
  }
}

First is from the 2004 paper Lock-Free Reference Counting by David L. Detlefs, figure 2, page 8 (edited for formatting):

void LFRCCopy(SNode **v, SNode *w) {
   Snode *oldv = *v;
   if (w != null) 
      add_to_rc(w,1);
   *v = w;
   LFRCDestroy(oldv);
}

void LFRCDestroy(SNode *v) {
   if (v != null && add_to_rc(v, -1)==1) {
     LFRCDestroy(v->L);
     LFRCDestroy(v->R);
     delete v;
   }
}

Where add_to_rc is an atomic increment and load. The second is from the Mat implementation in opencv4 (edited for brevity):

// Line 405
Mat::Mat(const Mat& m)
    : flags(m.flags), dims(m.dims), rows(m.rows), cols(m.cols), data(m.data),
      datastart(m.datastart), dataend(m.dataend), datalimit(m.datalimit), allocator(m.allocator),
      u(m.u), size(&rows), step(0)
{
    if( u )
        CV_XADD(&u->refcount, 1);
    ...
}

// Line 551
void Mat::release()
{
    if( u && CV_XADD(&u->refcount, -1) == 1 )
        deallocate();
    u = NULL;
    ...
}

Where CV_XADD is an atomic increment and load. And the third is from the std::shared_ptr implementation that ships with libstdc++ in the most recent version of GCC (10.2) (this implementation is very complex and I've edited it down a lot for brevity):

class _Sp_counted_base {

    void _M_add_ref_copy()
    { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); }

    // Line 161 for full implementation
    void _M_release() noexcept
    {
       _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
       if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
       {
          _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
          _M_dispose();
          if (_Mutex_base<_Lp>::_S_need_barriers)
             __atomic_thread_fence (__ATOMIC_ACQ_REL);
          // editors note: weak pointer handling removed for brevity
          _M_destroy();
       }
    }

}

class __shared_count {

    __shared_count(const __shared_count& __r) noexcept 
        : _M_pi(__r._M_pi)
    {
        if (_M_pi != nullptr)
           _M_pi->_M_add_ref_copy();
    }
    
    ~__shared_count() noexcept
    {
        if (_M_pi != nullptr)
           _M_pi->_M_release();
    }

    _Sp_counted_base* _M_pi;

}

class __shared_ptr {

    __shared_ptr(const __shared_ptr&) noexcept = default;

    ~__shared_ptr() = default;

    element_type*   _M_ptr;         // Contained pointer.
    __shared_count  _M_refcount;    // Reference counter.

}

Some explanation of that last one since it's kind of indirect:

Copy: __shared_ptr default copy constructor copies _M_refcount, whose copy constructor shares the same _Sp_counted_base as the original, and atomically increments the reference counter.
Delete: __shared_ptr default destructor destroys _M_refcount, whose destructor calls _M_release on the _Sp_counted_base, which atomically decrements and possibly deallocates the reference counter.

Anyways, all this boils down to one question which is the crux of my failure to understand why these work (even that old Dr. Dobbs article and the related questions here on SO [I feel like the answer might be here but I can't see it] raised more questions than answers for me):

In the copy operation, what is preventing a race condition where another thread performs the delete operation on the last reference count (possibly via another view on the shared object) then deallocates the object between the start of the copy operation and the start of the atomic counter increment -- thus, from what I (mis)understand, causing copy to increment a deallocated counter and crash or dump core everywhere or something?

That is, referring to opencv implementation above (since examining it is actually what started my whole mini research project here):

Mat::Mat(const Mat& m)
    : flags(m.flags), dims(m.dims), rows(m.rows), cols(m.cols), data(m.data),
      datastart(m.datastart), dataend(m.dataend), datalimit(m.datalimit), allocator(m.allocator),
      u(m.u),   // <--- START OF DANGER ZONE 
      size(&rows), step(0)
{
    if( u )     
                // <--- ALMOST AT END OF DANGER ZONE 
        CV_XADD(&u->refcount, 1);
    ...
}

void Mat::release()
{
    if( u && CV_XADD(&u->refcount, -1) == 1 )
        deallocate();
    u = NULL;
    ...
}

What keeps another thread from releasing the last reference via m in the marked "danger zone" in the copy constructor, thus leaving u non-null but pointing to post-deallocation garbage?

What am I missing here? I feel like it might be obvious and maybe I've been awake too long. In particular, the fact that the libstdc++ implementation also seems to follow this pattern gives it a ton of street cred; I just can't understand the details.

Usually, this boils down to what degree of safety the operations are expected to have. These RC objects are expected to be thread-safe, not atomic. So it is still the user's responsibility to avoid multiple concurrent writes (or a concurrent read and write) to the same reference, otherwise there is a race condition. — Marc Glisse, Dec 15 '20 at 07:18
@MarcGlisse Yeah; but the reference counters are shared aren't they? So like: `shared_ptr x(object) = ...; start_new_thread(shared_ptr(x) /* copy */);`, isn't the potential for the race still possible in the new thread even though it's working with the only instance of a safely created copy, if some copy over there happens at the same time that the last `shared_ptr` to `object` is destructed elsewhere? — Jason C, Dec 15 '20 at 07:23
In other words, if, in the copy operation, I put an artificial 30 second delay right before the reference counter is incremented, would everything still work? What would prevent all the *rest* of the objects that share that rc from being destroyed in the meantime, thus destroying the counter? — Jason C, Dec 15 '20 at 07:24
@JasonC I think that you may not use the _same_ `shared_ptr` object (`x` in your case) in two threads. Such that you cannot "copy" it in one thread and "destroy" it in another. — Daniel Langr, Dec 15 '20 at 07:28
If you are using a shared_ptr in one thread, the other thread cannot have the only reference. — Marc Glisse, Dec 15 '20 at 07:30
If you are using a `shared_ptr` in one thread either the other thread has its own `shared_ptr` and so a 'stake' in the reference count an its >=2 *or* some other synchronisation is ensuring 'borrowing'. If borrowing, the preferred method is to pass the raw pointer (`shared_ptr::get()`) to represent it has a reference to the stored object but not a direct 'stake' in its reference count. — Persixty, Dec 15 '20 at 08:25
I think the pseudo code is a touch vague. How about `was=Atomically_Decrement_Counter(shared);` and the next line to be `if(was==INIT_VALUE)` where `INIT_VALUE` is what shared was set to during construction and may not be 1 if there's a special life-cycle going on. Was must be the value of shared replaced during the atomic decrement. — Persixty, Dec 15 '20 at 08:47

Persixty · Accepted Answer · 2020-12-16T08:43:22.747

Short answer, you are tired!

This is the the classic 'reference counting model'. When an object is created its reference count is initialised (classically to 1). The creating thread owns that reference. That thread may increment or decrement that count when it creates a new reference and when it evetually reaches 0 no references exist and the object is deleted. That's all single threaded but your question is about multi-threaded.

There's two additional rule for multi-threaded.

A thread must not attempt to decrement the counter unless it already owns a reference.
Also, (key) a thread must not attempt to increment the counter unless it already holds a reference.

So the race condition of another thread trying to deallocate an object while a thread is trying to increase the number of references (here in the copy) shall never happen (and must never happen).

So if some other thread is concurrently in the process of releasing a reference the count while another thread is increasing it the count shall always be >=2 because both must be holding a reference.

__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) returns the replaced value so the line if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1)==1 means 'if this thread exchanged 1 for 0 meaning "if this thread is holding the last reference".

There are a number of strategies for ensuring that no thread decrements or increments the counter unless it holds a reference. But the main one is that a thread that does hold a reference increments it before 'passing' it to another thread or simply 'surrenders' its reference to the other thread.

In a 'thread pool' where objects are placed in a queue it may be indeterminate which thread the reference is being passed to and further synchronisation (e.g. on the queue) is required to ensure that only one thread with ever 'claim's that 'shared' reference.

By "holding" I mean either "owns" or "has borrowed". Owning means the thread that owns the reference (will ensure it's eventual release) and 'borrowing' means the actual owning (that will ensure its eventual release) thread will be synchronised with the borrower and ensure it isn't released until either the other thread has (a) acquired a reference or (b) shall not access the object again.

The code snippets appear valid (I'm not familiar with the detail of all those libraries) but confirming that __exchange_and_add_dispatch returns the old valueChapter 29. Concurrency is enough to convince me that is what is going on.

However the downside of these strategies is that all code that increases or decreases the count must conform to the strategy so we don't know if code is valid without close inspection of the design and implementation of all points that reference the objects. Does each hold a reference? If they increase it do they already hold a reference?

Weak references are a subtle finesse on this model and represent a separate counter that follows the same model. Though may be (as common for std::shared_ptr using std::make_shared) allocated (new) with the main object and while the main object may be destructed before the last weak reference the memory block is deleteded when both counters reach zero and if weak references exist when the last strong reference is released the weak reference is marked to show the main object has been deleted.

Oh! I understand (probably)! And you're right this is definitely a case of me being tired. You laying it out like that makes it clear. The part I kept forgetting was, essentially, that at least one reference *has* to be held while it's being copied - you can't be in the midst of a copy while no references exist. Or something like that. But I get it. And I'm also convinced it's time to stop working and try again in the AM. Thanks. — Jason C, Dec 15 '20 at 07:51
@JasonC It's an extension of the rule that no thread should read or write to an object if a reference isn't 'held' throughout that process. Nighty night! I guarantee you'll wake up with an "of course"! — Persixty, Dec 15 '20 at 07:58
You were right, btw: I woke up this morning like, "oh... right". Thanks again. — Jason C, Dec 16 '20 at 05:32

Why don't these lock-free reference counting implementations explode (or do they)?

1 Answers1