Does this redeem a thread-safe Double-Checked Locking pattern?

Question

The problems with the original Double-Checked Locking pattern have been well-documented: C++ and the Perils of Double-Checked Locking. I have seen the topic come up in questions on SO fairly often.

I have come up with a version that, to me, seems to solve the race condition problem of the original pattern, but does it look ok to you?

In the code below, I assume that LOCK is a properly implemented mutex type that causes a memory barrier at it's lock/unlock, and I don't attempt to deal with the deallocation of the instance, as that's not part of the pattern.

I am aware of the other possible ways to do a singleton, but that's not what I am asking for. I am specifically asking about the pattern - is the race condition solved by having the allocation outside of the mutex lock?.

template <typename T, typename LOCK>
class Singleton
{
private:
    static T * instance_;
    static LOCK mutex;

    // private - inaccessible
    Singleton();
    Singleton(const Singleton &);
    Singleton & operator=(const Singleton &);

public:
    static T * get_instance()
    {
        if (instance_ == NULL)
        {
            T * temp = new T;
            mutex.lock();
            if (instance_ == NULL)
                instance = temp;
            else
                delete temp;
            mutex.unlock();
        }
        return instance_;
    }
};

What's wrong with just allocating the instance in a temporary, issuing a memory barrier, then stuffing it into the final shared variable (then unlocking)? — Lily Ballard, Aug 26 '12 at 08:37
@KevinBallard, there is no portable way of doing a memory barrier, AFAIK — Lubo Antonov, Aug 26 '12 at 08:38
Weird question. There *is* a portable way of doing a memory barrier in C++11. Before then, there wasn't, *but there was no portable way of creating a mutex either. Or a thread, for that matter* — jalf, Aug 26 '12 at 09:20
@jalf, ok I see your point. So this could be rewritten with a memory barrier in c++11, but that's completely equivalent to this version. The question would be the same. — Lubo Antonov, Aug 26 '12 at 09:23

Mankarse · Accepted Answer · 2012-08-26T11:17:44.000

3

No this is not safe. There is a race condition on instance_ (one thread could be reading from it (if (instance_ == NULL)) while another is writing to it (instance = temp;)), so this code has undefined behaviour.

You are asking if the single memory fence created by the lock is sufficient. From the perspective of C++11, it is not. From a non-C++11 perspective, I can't say for sure, but relying on non-atomic and non-mutex types for synchronization seems unlikely to work (in the pre-C++11 world, mutexes and atomic variables only work through compiler and processor specific hacks, and it seems foolish to rely on them to do anything more than their bare specification).

edited Aug 26 '12 at 11:17

answered Aug 26 '12 at 08:44

Mankarse

39,818
11
97
141

I am not sure I follow. Are you saying that writing to a pointer is somehow not atomic? If the writing thread gets preempted after it has temp in a register, but before it has written it out, then the reading thread will see NULL and continue and hit the mutex. – Lubo Antonov Aug 26 '12 at 08:50
1

@LuboAntonov: Unless you have explicitly made the pointer [atomic](http://en.cppreference.com/w/cpp/atomic/atomic), any concurrent access/modification will result in undefined behaviour. – Mankarse Aug 26 '12 at 08:55
do you think that, at the system level, the write from a register to memory is not atomic? It seems to me that the reading thread will either see a 0 (in which case it will continue and stop on the mutex), or the correct pointer value for instance_. There is no intermediate possibility, where the memory location has only half the address or something like that. – Lubo Antonov Aug 26 '12 at 09:05
As far as a singleton implementation, your version is fine, but suffers the performance penalty of a cache synchronization every time you access instance_, essentially the same as _volatile_ (right?). Plus, this only works in c++ 0x11. As I mentioned, I am specifically interested in the DCL pattern and possible issues with this implementation of it. – Lubo Antonov Aug 26 '12 at 09:10
@LuboAntonov: `volatile` has nothing to do with cache synchronization or multi-threading. It is meant to indicate values which might be read from or written to by processes (in the general sense of the term) outside the view of the compiler. Also, the system level has absolutely nothing to do with the meaning of a C++ program (C++ is defined by the C++ standard). – Mankarse Aug 26 '12 at 09:22
Yes, of course you are right about volatile, bad analogy. Still, my point was about the effect of atomic. As far as the rest, we are already talking about things that exist outside of the standard, such as when a thread can get preempted and what happens to the cache. I don't think you can ignore the hardware when you think about multithreading. So to get to the core of your objection - is there any concievable scenario under which the final processor write of the address to the memory location can be partially completed when the thread gets preempted? – Lubo Antonov Aug 26 '12 at 09:34
To clarify: I know that _instance = temp_ is not atomic. My point is the logic of the reading thread will work regardless of whether the writing thread gets preempted here or not. – Lubo Antonov Aug 26 '12 at 09:46
@LuboAntonov: In what way are these things outside the standard? They are fully specified in `§1.10 Multi-threaded executions and data races`. – Mankarse Aug 26 '12 at 10:28
That said, I see your point now -- you are asking if the single memory fence created by the lock is sufficient. From the perspective of C++11, it is not. From a non-C++11 perspective, I can't say for sure, but relying on non-atomic and non-mutex types for synchronization seems unlikely to work (in the pre-C++11 world, mutexes and atomic variables only work through compiler and processor specific hacks, and it seems foolish to rely on them to do anything more than their bare specification). – Mankarse Aug 26 '12 at 10:28
I can agree with this, after reading the standard. Since it does not specify that a memory store of a pointer type has to be atomic, a compiler could potentially generate code to move the pointer value 1 byte at a time, which would of course create a race condition. I think there is 0 chance of such compiler ever existing, but the standard allows it. If you could add your last comment to your answer, I'd be glad to accept it. – Lubo Antonov Aug 26 '12 at 10:58
@LuboAntonov: The problem might not be in the exact store instruction generated by the compiler, but in the cache layer. Memory fences work all the way to the actual RAM. – MSalters Aug 27 '12 at 09:05
1

@MSalters, there is no problem with the cache - the mutex operations have an implied memory fence, or you can do an explicit memory fence (see the comments to the question). The problem is that the standard leaves as undefined the atomicity of the (low-level) memory store operation. So on x86 the atomicity is effectively guaranteed, since no compiler would break down a 32-bit store into 2 16-bit stores, lets say. But on an embedded architecture one could imagine a 32-bit address stored with two machine instructions of 16-bit each, resulting in a partial address, if the thread gets preempted. – Lubo Antonov Aug 27 '12 at 09:31

score 1 · Answer 2 · answered Aug 26 '12 at 15:31

The problem, as has been mentioned elsewhere, is that there is a data race on accesses to instance_: the first if statement reads the value, and the later assignment to instance_ writes to that variable. With no synchronization between these two operations, the behavior is undefined. But there's an easy solution if you have C++11. Change the declaration

static T * instance_;

to

static std::atomic<T *> instance_;

Now all the accesses of instance_ are atomic, which guarantees no tearing and provides synchronization.

Does this redeem a thread-safe Double-Checked Locking pattern?

2 Answers2

Linked