3

I left the rest of implementation for simplicity because it is not relevant here. Consider the classical implemetation of Double-check loking descibed in Modern C++ Design.

Singleton& Singleton::Instance()
{
    if(!pInstance_) 
    { 
         Guard myGuard(lock_); 
         if (!pInstance_) 
         {
            pInstance_ = new Singleton; 
         }
     }
     return *pInstance_;
}

Here the author insists that we avoid the race condition. But I have read an article, which unfortunately I dont remeber very well, in which the following flow was described.

  1. Thread 1 enters first if statement
  2. Thread 1 enters the mutex end get in the second if body.
  3. Thread 1 calls operator new and assigns memory to pInstance than calls a constructor on that memory;
  4. Suppose the thread 1 assigned the memory to pInstance but not created the object and thread 2 enters the function.
  5. Thread 2 see that the pInstance is not null (but yet not initialized with constructor) and returns the pInstance.

In that article the author stated then the trick is that on the line pInstance_ = new Singleton; the memory can be allocated, assigned to pInstance that the constructor will be called on that memory.

Relying to standard or other reliable sources, can anyone please confirm or deny the probability or correctness of this flow? Thanks!

Eduard Rostomyan
  • 7,050
  • 2
  • 37
  • 76
  • Sorry I don't quite understand your concern, once thread 1 has assigned `pInstance` it will be fine for any thread to return `pInstance` what do you mean by not created the object? `pInstance` will be initialised to `nullptr` – EdChum Jun 01 '18 at 10:49
  • Your step 4 is impossible. The assignment to `pInstance_` can not be done without the `new` have allocated memory and constructed the object first. – Some programmer dude Jun 01 '18 at 10:50
  • *" thread 1 assigned the memory to pInstance but not created the object and thread"* - Um.. what ? Are you concerned that `pInstance` has been assigned the memory pointer *before* the object constructor is actually called? If so, that isn't how `operator new` works – WhozCraig Jun 01 '18 at 10:51
  • Thats my point, the operator new works as follows, allocates memory than calls constructor on it. So the article states that the memory can be allocated, assigned to pInstance, but constructor not called and the context swich happes no thread 2. – Eduard Rostomyan Jun 01 '18 at 10:51
  • What context switching? You have a guard, `pInstance` is either `null` or it's been assigned to so it's safe to return – EdChum Jun 01 '18 at 10:52
  • 1
    You get the order wrong. The memory is allocated, the object is constructed, That's what `new` does. ***Then*** the assignment happens, separately from the `new`. The `new` and the assignment are two different operations sequenced in the correct order. – Some programmer dude Jun 01 '18 at 10:52
  • Whatever article you were reading was written by a fiend. The assignment isn't done until *after* (a) the memory has been allocated, and (b) construction is completed. Period. – WhozCraig Jun 01 '18 at 10:52
  • That was my question @Someprogrammerdude. I will find that article now and give you a link. The author states that that order is possible as well. – Eduard Rostomyan Jun 01 '18 at 10:53
  • 1
    In C++, you shouldn't have to worry about the construction order. The assignment will be sequenced after the constructor call. The real issue with the code example you posted is that we don't get to see how `pInstance_` is declared. It needs to be atomically `load`ed and `store`d to work properly on CPUs with a more relaxed memory model. Otherwise, thread #2 might see `pInstance_` be `nullptr` even after thread #1 releases the mutex. Alexandrescu mentions this toward the end of the section on the double-checked locking pattern. – Sean Cline Jun 01 '18 at 11:08
  • Despite its name, "Modern C++ Design" is no longer all that modern. It was a revelation to those of us just learning C++98, but rereading it lately I discovered that about half the material has since been superseded by language changes and additions to the standard library. C++17 is very different from C++98. And also, [Singletons are likely overused](https://stackoverflow.com/questions/86582/singleton-how-should-it-be-used) – Bo Persson Jun 01 '18 at 11:27
  • 1
    @EdChum The behavior of the code above is formally undefined if thread A calls `Singleton::Instance()` first, and then thread B subsequently calls `Singleton::Instance()` and attempts to use the returned pointer. In particular, on some architectures, it is possible for thread B to see the `Singleton` object in an un-initialized or a partially initialized state. See: https://en.cppreference.com/w/cpp/language/memory_model – Solomon Slow Jun 01 '18 at 13:31
  • @jameslarge thanks, wasn't aware of that – EdChum Jun 01 '18 at 14:23
  • @Someprogrammerdude - that's not at all how new generally works. Usually you get uninitialized storage (e.g., from `operator new (size_t)`), and then do construction and assign the pointer, but the compiler can reorder all of that and gcc actually does as my answer below shows. Such reordering is typical and common: the compiler doesn't need to emit assembly that follows the order in the source, it only needs to emit stuff that makes it _as if_ the source order was followed for the current thread. Other threads may see something different. So pretty much every comment here is wrong. – BeeOnRope Jun 10 '18 at 03:44

3 Answers3

5

The problem is that in the absence of guarantees otherwise, the store of the pointer into pInstance_ might be seen by some other thread before the construction of the object is complete. In that case, the other thread won't enter the mutex and will just immediately return pInstance_ and when the caller uses it, it can see uninitialized values.

This apparent reordering between the store(s) associated with the construction on Singleton and the store to pInstance_ may be caused by compiler or the hardware. I'll take a quick look at both cases below.

Compiler Reordering

Absent any specific guarantees guarantees related to concurrent reads (such as those offered by C++11's std::atomic objects) the compiler only needs to preserve the semantics of the code as seen by the current thread. That means, for example, that it may compile code "out of order" to how it appears in the source, as long as this doesn't have visible side-effects (as defined by the standard) on the current thread.

In particular, it would not be uncommon for the compiler to reorder stores performed in the constructor for Singleton, with the store to pInstance_, as long as it can see that the effect is the same1.

Let's take a look at a fleshed out version of your example:

struct Lock {};
struct Guard {
    Guard(Lock& l);
};

int value;

struct Singleton {
    int x;
    Singleton() : x{value} {}

    static Lock lock_;
    static Singleton* pInstance_;
    static Singleton& Instance();
};

Singleton& Singleton::Instance()
{
    if(!pInstance_) 
    { 
         Guard myGuard(lock_); 
         if (!pInstance_) 
         {
            pInstance_ = new Singleton; 
         }
     }
     return *pInstance_;
}

Here, the constructor for Singleton is very simple: it simply reads from the global value and assigns it to the x, the only member of Singleton.

Using godbolt, we can check exactly how gcc and clang compile this. The gcc version, annotated, is shown below:

Singleton::Instance():
        mov     rax, QWORD PTR Singleton::pInstance_[rip]
        test    rax, rax
        jz      .L9       ; if pInstance != NULL, go to L9
        ret
.L9:
        sub     rsp, 24
        mov     esi, OFFSET FLAT:_ZN9Singleton5lock_E
        lea     rdi, [rsp+15]
        call    Guard::Guard(Lock&) ; acquire the mutex
        mov     rax, QWORD PTR Singleton::pInstance_[rip]
        test    rax, rax
        jz      .L10     ; second check for null, if still null goto L10
.L1:
        add     rsp, 24
        ret
.L10:
        mov     edi, 4
        call    operator new(unsigned long) ; allocate memory (pointer in rax)
        mov     edx, DWORD value[rip]       ; load value global
        mov     QWORD pInstance_[rip], rax  ; store pInstance pointer!!
        mov     DWORD [rax], edx            ; store value into pInstance_->x
        jmp     .L1

The last few lines are critical, in particular the two stores:

        mov     QWORD pInstance_[rip], rax  ; store pInstance pointer!!
        mov     DWORD [rax], edx            ; store value into pInstance_->x

Effectively, the line pInstance_ = new Singleton; been transformed into:

Singleton* stemp = operator new(sizeof(Singleton)); // (1) allocate uninitalized memory for a Singleton object on the heap
int vtemp     = value; // (2) read global variable value
pInstance_    = stemp; // (3) write the pointer, still uninitalized, into the global pInstance (oops!)
pInstance_->x = vtemp; // (4) initialize the Singleton by writing x

Oops! Any second thread arriving when (3) has occurred, but (4) hasn't, will see a non-null pInstance_, but then read an uninitialized (garbage) value for pInstance->x.

So even without invoking any weird hardware reordering at all, this pattern isn't safe without doing more work.

Hardware Reordering

Let's say you organize so that the reordering of the stores above doesn't occur on your compiler2, perhaps by putting a compiler barrier such as asm volatile ("" ::: "memory"). With that small change, gcc now compiles this to have the two critical stores in the "desired" order:

        mov     DWORD PTR [rax], edx
        mov     QWORD PTR Singleton::pInstance_[rip], rax

So we're good, right?

Well on x86, we are. It happens that x86 has a relatively strong memory model, and all stores already have release semantics. I won't describe the full semantics, but in the context of two stores as above, it implies that stores appear in program order to other CPUs: so any CPU that sees the second write above (to pInstance_) will necessarily see the prior write (to pInstance_->x).

We can illustrate that by using the C++11 std::atomic feature to explicitly ask for a release store for pInstance_ (this also enables us to get rid of the compiler barrier):

    static std::atomic<Singleton*> pInstance_;
    ...
       if (!pInstance_) 
       {
          pInstance_.store(new Singleton, std::memory_order_release); 
       }

We get reasonable assembly with no hardware memory barriers or anything (there is a redundant load now, but this is both a missed-optimization by gcc and a consequence of the way we wrote the function).

So we're done, right?

Nope - most other platforms don't have the strong store-to-store ordering that x86 does.

Let's take a look at ARM64 assembly around the creation of the new object:

    bl      operator new(unsigned long)
    mov     x1, x0                         ; x1 holds Singleton* temp
    adrp    x0, .LANCHOR0
    ldr     w0, [x0, #:lo12:.LANCHOR0]     ; load value
    str     w0, [x1]                       ; temp->x = value
    mov     x0, x1
    str     x1, [x19, #pInstance_]  ; pInstance_ = temp

So we have the str to pInstance_ as the last store, coming after the temp->x = value store, as we want. However, the ARM64 memory model doesn't guarantee that these stores appear in program order when observed by another CPU. So even though we've tamed the compiler, the hardware can still trip us up. You'll need a barrier to solve this.

Prior to C++11, there wasn't a portable solution for this problem. For a particular ISA you could use inline assembly to emit the right barrier. Your compiler might have a builtin like __sync_synchronize offered by gcc, or your OS might even have something.

In C++11 and beyond, however, we finally have a formal memory model built-in to the language, and what we need there, for doubled check locking is a release store, as the final store to pInstance_. We saw this already for x86 where we checked that no compiler barrier was emitted, using std::atomic with memory_order_release the object publishing code becomes:

    bl      operator new(unsigned long)
    adrp    x1, .LANCHOR0
    ldr     w1, [x1, #:lo12:.LANCHOR0]
    str     w1, [x0]
    stlr    x0, [x20]

The main difference is the final store is now stlr - a release store. You can check out the PowerPC side too, where an lwsync barrier has popped up between the two stores.

So the bottom line is that:

  • Double checked locking is safe in a sequentially consistent system.
  • Real-world systems almost always deviate from sequential consistency, either due to the hardware, the compiler or both.
  • To solve that, you need tell the compiler what you want, and it will both avoid reordering itself and emit the necessary barrier instructions, if any, to prevent the hardware from causing a problem.
  • Prior to C++11, the "way you tell the compiler" to do that was platform/compiler/OS specific, but in C++ you can simply use std::atomic with memory_order_acquire loads and memory_order_release stores.

The Load

The above only covered half of the problem: the store of pInstance_. The other half that can go wrong is the load, and the load is actually the most important for performance, since it represents the usual fast-path that gets taken after the singleton is initialized. What if the pInstance_->x was loaded before pInstance itself was loaded and checked for null? In that case, you could still read an uninitialized value!

This might seem unlikely, since pInstance_ needs to be loaded before it is deferenced, right? That is, there seems to be a fundamental dependency between the operations that prevents reordering, unlike the store case. Well, as it turns out, both hardware behavior and software transformation could still trip you up here, and the details are even more complex than the store case. If you use memory_order_acquire though, you'll be fine. If you want the last once of performance, especially on PowerPC, you'll need to dig into the details of memory_order_consume. A tale for another day.


1 In particular, this means that the compiler has to be able to see the code for the constructor Singleton() so that it can determine that it doesn't read from pInstance_.

2 Of course, it's very dangerous to rely on this since you'd have to check the assembly after every compilation if anything changed!

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
4

The problem you describe can only occur if for reasons I cannot imagine the conceptors of the singleton uses an explicit (and broken) 2 steps construction:

     ...
     Guard myGuard(lock_); 
     if (!pInstance_) 
     {
        auto alloc = std::allocator<Singleton>();
        pInstance_ = alloc.allocate(); // SHAME here: race condition
        // eventually other stuff
        alloc.construct(_pInstance);   // anything could have happened since allocation
     }
     ....

Even if for any reason such a 2 step construction was required, the _pInstance member shall never contain anything else that nullptr or a fully constructed instance:

        auto alloc = std::allocator<Singleton>();
        Singleton *tmp = alloc.allocate(); // no problem here
        // eventually other stuff
        alloc.construct(tmp);              // nor here
        _pInstance = tmp;                  // a fully constructed instance

But beware: the fix is only guaranteed on a mono CPU. Things could be much worse on multi core systems where the C++11 atomics semantics are indeed required.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • why can there be a race condition when the lock is aquired first? – 463035818_is_not_an_ai Jun 01 '18 at 11:04
  • 1
    @user463035818: because of the outer `if(!_pInstance)` that will bypass the lock once `_pInstance` is not null – Serge Ballesta Jun 01 '18 at 11:07
  • oh sure, so the race condition is of course not about two thread both creating the instance inside the `if`, sorry I was super confused for a second – 463035818_is_not_an_ai Jun 01 '18 at 11:08
  • 1
    The purpose of the DCL pattern is to handle the situation whereby _other_ threads observe object creation and pointer allocation out of order. Your first code snippet is what other thread may see and that is why DCL is necessary – LWimsey Jun 01 '18 at 11:12
  • Before C++11, the optimizer was allowed to reorder code to have this effect even when there was no explicit 2-stage init. The store could have happened before, after or _during_ construction. – Useless Jun 01 '18 at 11:54
  • 1
    The answer describes only *program order* (order of assembler instructions generated), but forgets about **CPU ordering**. But exactly the latter makes DCL **unsafe**, unless `pInstance_` has a proper type. The problem is, while one CPU assigns `pInstance_` fields before assigning the variable itself, the other CPU may see (because of cache) non-initialized fields with non-NULL `pInstance_`. As stated in the wiki, in C++11 `pInstance_` should be of **atomic type**, and use "acquire" load and "release" store memory ordering. – Tsyvarev Jun 01 '18 at 12:10
  • FWIW: Double-Checked locking became a wide-spread pattern back in the days when most computers had only one CPU. It _works_ on single-CPU systems. It was the wide-spread adoption of multi-CPU systems that broke it. – Solomon Slow Jun 01 '18 at 13:34
  • 1
    @Tsyvarev : your comments scheds light on a interesting point. I tried to summarize it in my edit. Please feel free to edit further if you find it useful – Serge Ballesta Jun 01 '18 at 13:49
  • Internally, the compiler can (and does!) compile `new` down to pretty much exactly the "broken conception" you illustrate above. The current thread can't tell the difference, so this is allowed, but another thread can see it and blow up, hence the "DCL is unsafe" conventional wisdom (but it can be made safe). – BeeOnRope Jun 10 '18 at 03:46
  • @BeeOnRope: I don't agree. Standart says in [expr.ass] *... In all cases, the assignment is sequenced after the value computation of the right and left operands*. So the compiler does first allocate memory and only then construct, but the assignment to `_pInstance` shall not occur before everything has been fully constructed. A conformant implementation should use a temporary pointer before the assignement (my second code) – Serge Ballesta Jun 11 '18 at 07:57
  • @SergeBallesta - yup, in the _abstract machine_ the assignment occurs after the computation of the right hand side, but that only helps you reason about the possible executions of the abstract machine on that thread. It doesn't constrain what the compiler actually _does_ when emitting assembly code and it doesn't constrain (in the absence of synchronization) what other threads see. In particular, compilers "violate" the sequencing _all the time_ when emitting machine code: in the most [trivial example I could come up with](https://godbolt.org/g/aTbbYC), both `clang` and `gcc` reorder reads... – BeeOnRope Jun 11 '18 at 19:39
  • ... despite the read of `x` being sequenced before `y` (they are in entirely different statements). The compiler is, as usual, using its wide latitude within the "as if" clause to compile the code however it wants, as long as the current thread can't tell the difference. Pretty much any non-trivial optimization is going to result in assembly that doesn't adhere to sequencing rules, but the current thread is none-the-wiser: other threads, however, can definitely observe this, as per the details in my answer. BTW, your `tmp` fix does nothing: the compiler can transform it back to the original. – BeeOnRope Jun 11 '18 at 19:41
1

It used to be unspecified before C++11, because there was no standard memory model discussing multiple threads.

IIRC the pointer could have been set to the allocated address before the constructor completed so long as that thread would never be able to tell the difference (this could probably only happen for a trivial/non-throwing constructor).

Since C++11, the sequenced-before rules disallow that reordering, specifically

8) The side effect (modification of the left argument) of the built-in assignment operator ... is sequenced after the value computation ... of both left and right arguments, ...

Since the right argument is a new-expression, that must have completed allocation & construction before the left-hand-side can be modified.

Useless
  • 64,155
  • 6
  • 88
  • 132