Dekker's algorithm is a great example to explain sequential consistency.
Lets first give you a bit of background information. When a CPU does a store, the CPU won't wait for the store to be inserted in the (coherent) cache; once it is inserted into the cache it is globally visible to all other CPUs. The reason why a CPU doesn't wait is that it can be a long time before the cache line that contained the field you want to write is in the appropriate state; the cache line first needs to be invalidated on all other CPU's before the write can be performed on the cache-line and this can stall the CPU.
To resolve this problem modern CPUs make use of store buffers. So the store is written to the store buffer and at some point in the future, the store will be committed to the cache.
If the store is followed by a load to a different address, it could be that the store and the load are reordered. E.g. the cache line for the later load might already be in the right state while the cache line for the earlier store isn't. So the load is 'globally' performed before the store is 'globally' performed; so we have a reordering.
Now lets have a look at your example:
flag[tid] = true;
while (flag[other] == true) {
So what we have here is a store followed by a load to a different address and hence the store and the load can be reordered and the consequence is that the lock won't provide mutual exclusion since 2 threads at the same time could enter the critical section.
[PS]
It is likely that the 2 flags will be on the same cache line, but it isn't a guarantee.
This is typical what you see on the X86; an older store can be reordered with a newer load. The memory model of the X86 is called TSO. I will not go into the details because there is an optimization called store to load forwarding which complicates the TSO model. TSO is a relaxation of sequential consistency.
Dekker's and Peterson's mutex algorithms do not rely on atomic instructions like a compare and set; but they do rely on a sequential consistent system and one of the requirement of sequential consistency is that the loads and stores do not get reordered (or at least nobody should be able to proof they were reordered).
So what the compiler needs to do is to add the appropriate CPU memory fence to make sure that the older store and the newer load to a different address don't get reordered. And in this case that would be an expensive [StoreLoad] fence. It depends on the ISA, but on Intel X86 the [StoreLoad] will cause the load-store-unit stop executing loads till the store buffer has been drained.
You can't declare the fields of an array to be volatile. But there are a few options:
- don't use an array, but explicit fields.
- use an AtomicIntegerArray
- start messing around with fences using the VarHandle. Make sure the fields are accessed using a VarHandle with at least memory order opaque to prevent the compiler to optimize out any loads/stores.
Apart from what I said above, it isn't only the CPU that can cause problems. Because there is no appropriate happens before edge between writing and reading the flags, the JIT can also optimize this code.
For example:
if(!flag[other])
return;
while (true) {
if (turn == other) {
flag[tid] = false;
while (turn == other){}
flag[tid] = true;
}
}
And now it is clear that this lock could wait indefinitely (deadlock) and doesn't need to see change in the flag. More optimizations are possible, that could totally mess up your code.
if (turn == other) {
while (true) {}
}
If you add the appropriate happens before edges, the compiler will add:
- compiler fences
- CPU memory fences.
And Dekker's algorithm should work as expected.
I would also suggesting setting up a test using JCStress. https://github.com/openjdk/jcstress and have a look at this test:
https://github.com/openjdk/jcstress/blob/master/jcstress-samples/src/main/java/org/openjdk/jcstress/samples/concurrency/mutex/Mutex_02_DekkerAlgorithm.java