Are those memory barriers necessary?

Question

I encountered the following implementation of Singleton's get_instance function:

template<typename T>
T* Singleton<T>::get_instance()
{
    static std::unique_ptr<T> destroyer;

    T* temp = s_instance.load(std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_acquire);

    if (temp == nullptr) 
    {
        std::lock_guard<std::mutex> lock(s_mutex);
        temp = s_instance.load(std::memory_order_relaxed);/* read current status of s_instance */
        if (temp == nullptr) 
        {
            temp = new T;
            
            destroyer.reset(temp);
            std::atomic_thread_fence(std::memory_order_release);
            s_instance.store(temp, std::memory_order_relaxed);
        }
    }
    
    return temp;
}

And I was wondering - is there any value in the acquire and release memory barriers there? As far as I know - memory barriers are aimed to prevent reordering of memory operations between 2 different variables. Let's take the classic example:

(This is all in pseudo-code - don't be caught on syntax)

# Thread 1
while(f == 0);
print(x)

# Thread 2
x = 42;
f = 1;

In this case, we want to prevent the reordering of the 2 store operations in Thread 2, and the reordering of the 2 load operations in Thread 1. So we insert barriers:

# Thread 1
while(f == 0)
acquire_fence
print(x)

# Thread 2
x = 42;
release_fence
f = 1;

But in the above code, what is the benefit of the fences?

EDIT

The main difference between those cases as I see it is that, in the classic example, we use memory barriers since we deal with 2 variables - so we have the "danger" of Thread 2 storing f before storing x, or alternatively having the danger in Thread 1 of loading x before loading f.

But in my Singleton code, what is the possible memory reordering that the memory barriers aim to prevent?

NOTE

I know there are other ways (and maybe better) to achieve this, my question is for educational purposes - I'm learning about memory barriers and curious to know if in this particular case they are useful. So all other things here not relevant for this manner please ignore.

The `destroyer` object releases memory, but why not let the `singleton` destructor handle that ? — LWimsey, Mar 28 '22 at 23:00
Why not just `static T inst; return &inst;`? That’s fully threadsafe, doesn’t malloc, and will be hard to beat performance-wise. — Ben, Mar 29 '22 at 02:34
`temp` is a pointer, and the pointed-to memory is written by a constructor. A release-store of the pointer to `s_instance` ensures that things which load the pointer and deref it will see valid data. `s_instance.store(temp, mo_release)` would be more efficient on some machines and easier to type, so you don't actually separate fences unless you insist on using `mo_relaxed`. (At least I don't see any reason for using a separate fence.) — Peter Cordes, Mar 29 '22 at 06:24
StoreStore ordering between the memory `temp` points to, and the visibility of the pointer to other threads via `s_instance.store`, like I said in my last comment. — Peter Cordes, Mar 29 '22 at 07:50
@PeterCordes You mean the assignment to `temp` may occur **after** the assignment of `temp` into `s_instance`? — YoavKlein, Mar 29 '22 at 07:53
@PeterCordes - I can't see how this might happen! The order of `temp = new T` and `s_instance.store(temp, mo_relaxed)` must not be reordered, since there's an obvious causality between those operations. If it would happen - then `NULL` will be stored in `s_instance` and it wouldn't be changed ! Unlike what happens with `x` and `f` which eventually both will be visible to other threads... — YoavKlein, Mar 29 '22 at 08:10
No, `temp` is a private local var. I'm talking about the assignments to the memory it *points to*, done by the constructor run by `new T`. And it's not a matter of "occurring" after (in this thread), it's a matter of *becoming visible* to other threads after. Memory reordering isn't the same thing as out-of-order execution. For stores on real CPUs, it's the order of [commit from the store buffer to L1d cache](https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram), nothing to do with actual (out-of-order) execution. — Peter Cordes, Mar 29 '22 at 08:13
Or to put it another way, you can't explain memory reordering in terms of moving lines around in the source code. — Peter Cordes, Mar 29 '22 at 08:14
@PeterCordes, I got you now, thanks. So basically this release fence prevents a situation in which the memory address held by `s_instance` becomes visible before the actual object has been initialized by the ctor. But what about the acquire fence above? Does it have any significance? — YoavKlein, Mar 29 '22 at 08:23
It has to sync-with the release store in the writer that did the constructing, to make sure a reader sees valid stuff in the pointed-to memory when it derefs the return value. — Peter Cordes, Mar 29 '22 at 08:40

score 3 · Answer 1 · answered Mar 29 '22 at 12:07

3

The complexity of this pattern (named double-checked-locking, or DCLP) is that data synchronization can happen in 2 different ways (depending on when a reader accesses the singleton) and they kind of overlap.
But since you're asking about fences, let's skip the mutex part.

But in my Singleton code, what is the possible memory reordering that the memory barriers aim to prevent?

This is not very different from your pseudo code where you already noticed that the acquire and release fences are necessary to guarantee the outcome of 42.
f is used as the signalling variable and x better not be reordered with it.

In the DCL pattern, the first thread gets to allocate memory: temp = new T;
The memory temp is pointing at, is going to be accessed by other threads, so it must be synchronized (ie. visible to other threads).
The release fence followed by the relaxed store guarantees that the new operation is ordered before the store such that other threads will observe the same order. Thus, once the pointer is written to the atomic s_instance and other threads read the address from s_instance, they will also have visibility to the memory it is pointing at.

The acquire fence does the same thing, but in opposite order; it guarantees that everything that is sequenced after the relaxed load and fence (ie. accessing the memory) will not be visible to the thread that allocated this memory. This way, allocating memory in one thread and using it in another will not overlap.

In another answer, I tried to visualize this with a diagram.

Note that these fences always come in pairs, a release fence without acquire is meaningless, although you could also use (and mix) fences with release/acquire operations.

s_instance.store(temp, std::memory_order_release); // no standalone fence necessary

The cost of DCLP is that every use (in every thread) involves a load/acquire, which at a minimum requires an unoptimized load (ie. load from L1 cache). This is why static objects in C++11 (possibly implemented with DCLP) might be slower than in C++98 (no memory model).

For more information about DCLP, check this article from Jeff Preshing.

answered Mar 29 '22 at 12:07

LWimsey

6,189
2
25
53

Thanks. I can clearly understand the need of the release fence, but fail to understand the need of the acquire fence. If the release fence guarantee that the value of `s_instnace` will be visible only AFTER the actual instance is initialized - then in the moment the reading thread will be able to read the value of `s_instance` - only then it will know the location of the instance, and at that point the instance itself is already initialized. What can go wrong? – YoavKlein Mar 29 '22 at 15:43
@YoavKlein Acquire ordering is typically harder to comprehend, but it is actually the exact same thing; Memory ordering is a 2-way street. You don't want your writes reordered with the release, but neither do you want your reads reordered with the acquire (in opposite direction). Without acquire ordering, the reads could return values from before the first thread performed a release. (I'm using reordering of reads & writes as an example, but it applies to both). – LWimsey Mar 29 '22 at 16:21
I understand why we need the acquire barrier in the classic example discussed above, but there's an essential difference between the cases - in the classic example the loads can be reordered since the loading thread has the memory addresses of both `x` and `f`, whereas in the DCL case this is not the case - in the point in time where the reading thread has the `s_instance` assigned only then it could access the memory pointed to by it ! Can you explain EXACTLY what could happen? – YoavKlein Mar 29 '22 at 17:31
I feel I lost in " it guarantees that everything that is sequenced after the relaxed load and fence (ie. accessing the memory) will not be visible to the thread that allocated this memory. This way, allocating memory in one thread and using it in another will not overlap." Can you better explain this paragraph? – YoavKlein Mar 29 '22 at 19:23
@YoavKlein The classic example is about synchronizing `x` using `f` as a signalling variable. You're right that the loading thread has direct access to `x` (it knows its address), but how is that different from the DCLP case? The releasing threads allocates memory and hands it over to the acquiring thread, which then has access to the same memory returned by `new`. I can only repeat that (regarding acquire/release) there is no fundamental difference between both scenarios. – LWimsey Mar 30 '22 at 13:46
The reading threads sets an acquire barrier which prevents access to unsynchronized memory. The fact that memory is not initialized yet (after `new`) is irrelevant; memory _must be_ synchronized or accessing it is undefined behavior. This is literally required by the C++ standard. And to explain what 'exactly' can happen without acquire ordering is difficult. Nothing might happen at all, but per the standard, it is undefined behavior. – LWimsey Mar 30 '22 at 13:47
@YoavKlein About the overlap part... I was hoping the diagram I referred to in the answer might clarify this. The whole point of this acq/rel ordering is that threads access memory in a strictly ordered way. The standard calls it 'happens before'. If you follow the rules, everything the releasing thread does with this memory, happens before everything the acquiring thread does with it. – LWimsey Mar 30 '22 at 13:47

Are those memory barriers necessary?

EDIT

NOTE

1 Answers1