Let us start with some definitions. The volatile
keyword produces an acquire-fence on reads and a release-fence on writes. These are defined as follows.
- acquire-fence: A memory barrier in which other reads and writes are not allowed to move before the fence.
- release-fence: A memory barrier in which other reads and writes are not allowed to move after the fence.
The method Thread.MemoryBarrier
generates a full-fence. That means it produces both the acquire-fence and the release-fence. Frustratingly the MSDN says this though.
Synchronizes memory access as follows: The processor executing the
current thread cannot reorder instructions in such a way that memory
accesses prior to the call to MemoryBarrier execute after memory
accesses that follow the call to MemoryBarrier.
Interpreting this leads us to believe that it only generates a release-fence though. So what is it? A full fence or half fence? That is probably a topic for another question. I am going to work under the assumption that it is a full fence because a lot of smart people have made that claim. But, more convincingly, the BCL itself uses Thread.MemoryBarrier
as if it produced a full-fence. So in this case the documentation is probably wrong. Even more amusingly the statement actually implies that instructions before the call can somehow be sandwiched between the call and instructions after it. That would be absurd. I say this in jest (but not really) that it might benefit Microsoft to have a lawyer review all documentation regarding threading. I am sure their legalese skills could be put to good use in that area.
Now I am going to introduce an arrow notation to help illustrate the fences in action. An ↑ arrow will represent a release-fence and a ↓ arrow will represent an acquire-fence. Think of the arrow head as pushing memory access away in the direction of the arrow. But, and this is important, memory accesses can move past the tail. Read the definitions of the fences above and convince yourself that the arrows visually represent those definitions.
Next we will analyze the loop only as that is the most important part of the code. To do this I am going to unwind the loop. Here is what it looks like.
LOOP_TOP:
// Iteration 1
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP
// Iteration 2
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP
...
// Iteration N
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP
LOOP_BOTTOM:
Notice that the call to Thread.MemoryBarrier
is constraining the movement of some of the memory access. For example, the read of toggle
cannot move before the read of stop
or vice-versa because those memory access are not allowed to move through an arrow head.
Now imagine what would happen if the full-fence were removed. The C# compiler, JIT compiler, or hardware are now have a lot more liberty in the moving the instructions around. In particular the lifting optimization, know formally as loop invariant code motion, is now allowed. Basically the compiler detects that stop
is never modified and so the read is bubbled up out of the loop. It is now effectively cached into a register. If the memory barrier were in place then the read would have to push up through an arrow head and the specification specifically disallows that. This is much easier to visualize if you unwind the loop like I did above. Remember, the call to Thread.MemoryBarrier
would be occurring on every iteration of the loop so you cannot simply draw conclusions about what would happen from only a single iteration.
The astute reader will notice that the compiler is free to swap the read of toggle
and stop
in such a manner that stop
gets "refreshed" at the end of the loop instead of the beginning, but that is irrelevant to the contextual behavior of the loop. It has the exact same semantics and produces the same result.
My question is why, without adding a Thread.MemoryBarrier(), or even
Console.WriteLine() in the while loop fixes the issue?
Because the memory barrier places restrictions on the optimizations the compiler can perform. It would disallow loop invariant code motion. The assumption is that Console.WriteLine
produces a memory barrier which is probably true. Without the memory barrier the C# compiler, JIT compiler, or hardware are free to hoist the read of stop
up and outside of the loop itself.
I am guessing that because on a multi processor machine, the thread
runs with its own cache of values, and never retrieves the updated
value of stop because it has its value in cache?
In a nutshell...yes. Though keep in mind that it has nothing to do with the number of processors. This can be demonstrated with a single processor.
Or is it that the main thread does not commit this to memory?
No. The main thread will commit the write. The call to Thread.Join
ensures that because it will create a memory barrier that disallows the movement of the write to fall below the join.
Also why does Console.WriteLine() fix this? Is it because it also
implements a MemoryBarrier?
Yes. It probably produces a memory barrier. I have been keeping a list of memory barrier generators here.