double check locking without volatile (but with VarHandle release/acquire)

Question

The question is rather easy, in a way. Suppose I have this class:

static class Singleton {

}

And I want to provide a singleton factory for it. I can do the (probably) obvious. I am not going to mention the enum possibility or any other, as they are of no interest to me.

static final class SingletonFactory {

    private static volatile Singleton singleton;

    public static Singleton getSingleton() {
        if (singleton == null) { // volatile read
            synchronized (SingletonFactory.class) {
                if (singleton == null) { // volatile read
                    singleton = new Singleton(); // volatile write
                }
            }
        }
        return singleton; // volatile read
    }
}

I can get away from one volatile read with the price of higher code complexity:

public static Singleton improvedGetSingleton() {
    Singleton local = singleton; // volatile read
    if (local == null) {
        synchronized (SingletonFactory.class) {
           local = singleton; // volatile read
           if (local == null) {
               local = new Singleton();
               singleton = local; // volatile write
           }
        }
    }

    return local; // NON volatile read
}

This is pretty much what our code has been using for close to a decade now.

The question is can I make this even faster with release/acquire semantics added in java-9 via VarHandle:

static final class SingletonFactory {

    private static final SingletonFactory FACTORY = new SingletonFactory();

    private Singleton singleton;

    private static final VarHandle VAR_HANDLE;

    static {
        try {
            VAR_HANDLE = MethodHandles.lookup().findVarHandle(SingletonFactory.class, "singleton", Singleton.class);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    private static Singleton getInnerSingleton() {

        Singleton localSingleton = (Singleton) VAR_HANDLE.getAcquire(FACTORY); // acquire

        if (localSingleton == null) {
            synchronized (SingletonFactory.class) {
                localSingleton = (Singleton) VAR_HANDLE.getAcquire(FACTORY); // acquire
                if (localSingleton == null) {
                    localSingleton = new Singleton();
                    VAR_HANDLE.setRelease(FACTORY, localSingleton); // release
                }
            }
        }

        return localSingleton;
    }
    
}

Would this be a valid and correct implementation?

Is your question simply about whether this is correct, or are you actually intending to use this? If the latter, wouldn't a lazy holder just be easier? — Andy Turner, Dec 06 '20 at 21:17
@AndyTurner But you might need to lazily initialize an instance variable. — spongebob, Dec 06 '20 at 21:24
@AndyTurner at this point I am only doing this as an exercise. But as said in the question I am not interested in the lazy holder, for the reasons Francesco mentioned. This is simplified to make it easier to understand, I guess. — Eugene, Dec 06 '20 at 21:33
There is one more possible solution. You can get rid of the synchronized block and replace it with a CAS, although this risks creating the singleton object multiple times (by multiple threads). — ciamej, Dec 07 '20 at 02:19
Your example should even be fine with a store-store fence while setting the localSingleton and a load-load fence while loading the singleton. — pveentjer, Jul 30 '21 at 14:46

spongebob · Accepted Answer · 2020-12-07T17:42:19.277

Yes, this is correct, and it is present on Wikipedia. (It doesn't matter that the field is volatile, since it is only ever accessed from VarHandle.)

If the first read sees a stale value, it enters the synchronized block. Since synchronized blocks involve happen-before relationships, the second read will always see the written value. Even on Wikipedia it says sequential consistency is lost, but it refers to the fields; synchronized blocks are sequentially consistent, even though they use release-acquire semantics.

So the second null check will never succeed, and the object is never instantiated twice.

It is guaranteed that the second read will see the written value, because it is executed with the same lock held as when the value was computed and stored in the variable.

On x86 all loads have acquire semantics, so the only overhead would be the null check. Release-acquire allows values to be seen eventually (that's why the relevant method was called lazySet before Java 9, and its Javadoc used that exact same word). This is prevented in this scenario by the synchronized block.

Instructions may not be reordered out and into synchronized blocks.

Eugene · Answer 2 · 2020-12-07T02:46:23.357

4

I am going to try and answer this myself... TL;DR : This is a correct implementation, but potentially more expensive than the one with volatile?.

Though this looks better, it can under-perform in some case. I am going to push myself against the famous IRIW example : independent reads of independent writes:

                        volatile x, y
     -----------------------------------------------------
     x = 1  |  y = 1   |     int r1 = x   |    int r3 = y
            |          |     int r2 = y   |    int r4 = x

This reads as :

there are two threads (ThreadA and ThreadB) that write to x and y (x = 1 and y = 1)
there are two more threads (ThreadC and ThreadD) that read x and y, but in reverse order.

Because x and y are volatile a result as below is impossible:

 r1 = 1 (x)      r3 = 1 (y)
 r2 = 0 (y)      r4 = 0 (x)

This is what sequential consistency of volatile guarantees. If ThreadC observed the write to x (it saw that x = 1), it means that ThreadD MUST observe the same x = 1. This is because in a sequential consistent execution writes happens as-if in global order, or it happens as-if atomically, everywhere. So every single thread must see the same value. So this execution is impossible, according to the JLS too:

If a program has no data races, then all executions of the program will appear to be sequentially consistent.

Now if we move the same example to release/acquire (x = 1 and y = 1 are releases while the other reads are acquires):

                       non-volatile x, y
     -----------------------------------------------------
     x = 1  |  y = 1   |     int r1 = x   |    int r3 = y
            |          |     int r2 = y   |    int r4 = x

A result like:

r1 = 1 (x)      r3 = 1 (y)
r2 = 0 (y)      r4 = 0 (x)

is possible and allowed. This breaks sequential consistency and this is normal, since release/acquire is "weaker". For x86 release/acquire does not impose a StoreLoad barrier , so an acquire is allowed to go above (reorder) an release (unlike volatile which prohibits this). In simpler words volatiles themselves are not allowed to be re-ordered, while a chain like:

 release ... // (STORE)
 acquire ... // this acquire (LOAD) can float ABOVE the release

is allowed to be "inverted" (reordered), since StoreLoad is not mandatory.

Though this is somehow wrong and irrelevant, because JLS does not explain things with barriers. Unfortunately, these are not yet documented in the JLS either...

If I extrapolate this to the example of SingletonFactory, it means that after a release :

 VAR_HANDLE.setRelease(FACTORY, localSingleton);

any other thread that does an acquire:

Singleton localSingleton = (Singleton) VAR_HANDLE.getAcquire(FACTORY);

is not guaranteed to read the value from the release (a non-null Singleton).

Think about it: in case of volatile, if one thread has seen the volatile write, every other thread will, for sure, see it too. There is no such guarantee with release/acquire.

As such, with release/acquire every thread might need to enter the synchronized block. And this might happen for many threads, because it's really unknown when the store that happened in the release will be visible by the load acquire.

And even if the synchronized itself does offer happens-before order, this code, at least for some time (until the release is observed) is going to perform worse? (I assume so): every thread competing to enter the synchronized block.

So in the end - this is about what is more expensive? A volatile store or an eventually seen release. I have no answer to this one.

edited Dec 07 '20 at 02:46

answered Dec 06 '20 at 18:57

Eugene

117,005
15
201
306

I don't think there is any real performance penalty at least on x86. The volatile write makes sure that the store becomes visible by issuing an mfence instruction which is simply waiting for the store to complete. The same amount of time has to pass for the store to become visible with release semantics within a synchronized block, just the thread is free to continue executing and doesn't have to wait. – ciamej Dec 07 '20 at 02:17
@ciamej I was editing the answer while you were making this comment... – Eugene Dec 07 '20 at 02:24
"Think about it: in case of volatile every subsequent read will see the non-null value after the write, with release/acquire there is not such guarantee." -> This is only so, because the volatile write takes much more time than a release one; reads which are concurrent with the (long running) volatile write can observe a null value. – ciamej Dec 07 '20 at 02:30
@ciamej I am no expert, but afaik, before committing the store on `CPU1` an invalidate to all caches that have that address will be issue, at least on x86. So I don't know if you are correct, but as said, I am no expert. – Eugene Dec 07 '20 at 03:36
I'm not sure what you're asking. Yes, the caches will be invalidated, in both cases. In the volatile write case, the thread waits until it gets to know that the caches were invalidated. In the release write case, the thread continue execution, and the caches invalidation occurs "asynchronously'. – ciamej Dec 07 '20 at 03:44
@ciamej you said _reads which are concurrent with the (long running) volatile write can observe a null value_ , in my understanding this means that _after_ you issued a volatile write, some thread can still observe the not-yet-written value. Did I understand you correctly? – Eugene Dec 07 '20 at 03:47
Yes, that's true. Thread A issues a store (let's say it is a volatile one, which means it is followed by an mfence). The cache invalidation message is sent through the cache coherence protocol. But the message does not arrive immediately. Nothing in this world happens immediately, even light propagation requires some time. During this time another thread running on a different core/cpu can read a cached stale value. – ciamej Dec 07 '20 at 03:53
@ciamej thank you. this is sort of crazy, at least for me. It means that the JLS part with : _A write to a volatile field happens-before every subsequent read of that same field_ has a very different meaning of what I expect – Eugene Dec 07 '20 at 03:57
It just means that once the volatile write finished each read that begins afterwards has to observe that written value. If the write and the read were concurrent (on a time diagram their intervals overlap) there is no such guarantee, but it is possible that the read will already observe the effects of the write. – ciamej Dec 07 '20 at 04:00
@ciamej now that I think about it, this also changes my view in sequential consistency vs release/acquire. If there are platforms (in theory at least) where caches could be updated independently, a release could be observed by only some threads, while others can still see that not-yet-written value. Again, I am thankful for your explanation here. – Eugene Dec 07 '20 at 04:04
I'm not an expert on ARM architecture, but I know they have more sophisticated cache coherency than x86. Perhaps what you describe is possible in ARM, but I don't know... Definitely there's more use of the load and store barriers and volatile read is not free anymore. When thinking about concurrency I always think about x86, but when programming in java you shouldn't really make assumptions and should stick with the formal memory model. – ciamej Dec 07 '20 at 04:11
So apart from ordering guarantees, the key difference between SC and release consistency is that the former is single copy atomic and the latter is non multi copy atomic. So with SC there single point in time where the store becomes visible and with RC there is no single point. So it could be that some cores see the write earlier and some see it a bit later. But both SC and RC do not provide any real time guarantee when the stores becomes visible. Loads/stores can be skewed as long as PO isn't violated (or nobody can proof if was violated). – pveentjer Jul 30 '21 at 14:37
Therefor this part "is not guaranteed to read the value from the release (a non-null Singleton)." doesn't make a lot of sense since SC also doesn't provide any real-time guarantees. Just as with SC, the store will eventually come visible because loads/stores can't be skewed indefinitely. There is no need for a synchronized block. – pveentjer Jul 30 '21 at 14:39
In short: if the use case fits into the message passing idiom (like singleton) then RC will provide more opportunities for optimization compared to SC. – pveentjer Jul 30 '21 at 14:41
I do not know any CPU architectures that have incoherent caches; the non multi copy store atomicity problems either come from sharing the store buffer with a subset of the CPUs or because the CPU doesn't wait for the ACK's of the cache-line invalidations. – pveentjer Jul 30 '21 at 14:44

double check locking without volatile (but with VarHandle release/acquire)

2 Answers2

Linked