I suppose I might as well provide my answer...
First, "Method 1" works fine on C++0x. From the draft standard, section 6.7 (4) (emphasis mine):
The zero-initialization (8.5) of all block-scope variables with static storage duration (3.7.1) or thread storage duration (3.7.2) is performed before any other initialization takes place. ... Otherwise such a variable is initialized the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization. ... If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization.
So if you have C++0x, "Method 1" is a simple, correct, and 100% portable way to implement a thread-safe singleton. (Even prior to C++0x, g++ on Unix ensures this idiom will work. I do not know about MSVC.) This is also very likely to be the fastest solution, since the compiler can see exactly what you are doing and it knows more than you about your CPU's architecture.
The easy way to fix "Method 2" is to put a mutex around the entire function, as Mihran suggests. But that might have unfortunate performance consequences, so people are always looking for ways to optimize it. Most of those ways introduce subtle bugs...
The "double-checked locking pattern" is the classic example. It looks like this:
if (m_instance == NULL) {
grab_mutex();
if (m_instance == NULL)
m_instance = new Whatsit();
release_mutex();
}
return m_instance;
There are two problems with this pattern. First, individual memory accesses are not guaranteed to be atomic in general; simultaneous loads and stores of a single memory location by different threads can result in garbage being read. (I grant this is unlikely for a pointer value -- and certainly it will not happen on x86/x86_64 -- but do you really want your code to work on just one platform for one day?)
Second, both the compiler and the CPU are free to reorder memory accesses. So the thread that runs the Whatsit
constructor and then fills in m_instance
might actually perform those writes out-of-order... Meaning another thread can test m_instance
, observe it to be non-NULL, and then access the object before it has been initialized. This is not hypothetical; it really happens on modern CPUs. Worse, the more modern the CPU, the more likely it is to be a problem, because CPUs keep getting more and more aggressive about reordering memory accesses.
To fix this pattern, the first read of m_instance
needs to have "acquire semantics", and the write to m_instance
needs to have "release semantics". Definitions: If a memory load L has "acquire semantics", then subsequent loads may not be reordered to happen before L. Similarly, if a store S has "release semantics", then previous stores must not be reordered to happen after S.
Both of these are required in order for "double-checked locking" to be thread-safe. (Plus the individual loads and stores must be atomic.)
As Justin points out, declaring m_instance
"volatile" will provide these semantics on MSVC. But this is not guaranteed by any standard, and in fact it is not true for many compilers (e.g. g++ on Unix). But if you are certain you will never care about any platform other than x86 Windows -- which you aren't -- then "volatile" will work.
What about using a compare-and-swap operation like InterlockedCompareExchangePointer?
if (m_instance == NULL) {
Whatsit *p = new Whatsit();
if (InterlockedCompareExchangePointer(&m_instance, p, NULL) != NULL)
delete p;
}
return m_instance;
The documentation says InterlockedCompareExchangePointer provides a "full memory barrier", which means it definitely has release semantics. So is this correct?
No, not in general. The problem is that the outer read of m_instance
does not necessarily have "acquire semantics". So in principle, another thread could still read the members of the object before they have been initialized. Modern CPUs do perform "speculative loads"; that is, they try to "guess" what you are going to need in order to load it from memory ahead of time. Granted, on a cache-coherent platform like x86, the CPU would have to be near-psychic to read an object before its pointer had even been computed... But if you use this pattern just because it happens to work on your system today, when it breaks on some future system and introduces night-impossible-to-debug failures, somebody will be cursing your name. So please do not do this.
Finally, you can use an API designed for one-time initialization, like the "One-Time Initialization API" on Windows or pthread_once on Unix. These work correctly by definition, but obviously they are platform-specific.
Bottom line: Use "Method 1" if your system supports it for this purpose.