Explanation of Thread.MemoryBarrier() Bug with OoOP

Question

Ok so after reading Albahari's Threading in C#, I am trying to get my head around Thread.MemoryBarrier() and Out-of-Order Processing.

Following Brian Gideon's answer on the Why we need Thread.MemoerBarrier() he mentions the following code causes the program to loop indefinitely on Release mode and without debugger attached.

class Program
{
    static bool stop = false;

    public static void Main(string[] args)
    {
        var t = new Thread(() =>
        {
            Console.WriteLine("thread begin");
            bool toggle = false;
            while (!stop)
            {
                // Thread.MemoryBarrier() or Console.WriteLine() fixes issue
                toggle = !toggle;
            }
            Console.WriteLine("thread end");
        });
        t.Start();
        Thread.Sleep(1000);
        stop = true;
        Console.WriteLine("stop = true");
        Console.WriteLine("waiting...");
        t.Join();
    }
}

My question is why, without adding a Thread.MemoryBarrier(), or even Console.WriteLine() in the while loop fixes the issue?

I am guessing that because on a multi processor machine, the thread runs with its own cache of values, and never retrieves the updated value of stop because it has its value in cache?

Or is it that the main thread does not commit this to memory?

Also why does Console.WriteLine() fix this? Is it because it also implements a MemoryBarrier?

score 3 · Answer 1 · answered Mar 18 '14 at 11:28

The compiler and CPU are free to optimize your code by re-ordering it in any way they see fit, as long as any changes are consistent for a single thread. This is why you never encounter issues in a single threaded program.

In your code you've got two threads that are using the stop flag. The compiler or CPU may choose to cache the value in a CPU register since in the thread you create since it can detect that your not writing to it in the thread. What you need is some way to tell the compiler/CPU that the variable is being modified in another thread and therefore it shouldn't cache the value but should read it from memory.

There are a couple of easy ways to do this. One if by surrounding all access to the stop variable in a lock statement. This will create a full barrier and ensure that each thread sees the current value. Another is to use the Interlocked class to read/write the variable, as this also puts up a full barrier.

There are also certain methods, such as Wait and Join, that also put up memory barriers in order to prevent reordering. The Albahari book lists these methods.

Just more info: [Loop Read Hoisting](https://msdn.microsoft.com/en-us/magazine/jj883956.aspx) — Legends, Dec 21 '17 at 01:11

oleksii · Accepted Answer · 2014-03-18T11:15:20.253

It doesn't fix any issues. It's a fake fix, rather dangerous in production code, as it may work, or it may not work.

The core problem is in this line

static bool stop = false;

The variable that stops a while loop is not volatile. Which means it may or may not be read from memory all the time. It can be cached, so that only the last read value is presented to a system (which may not be the actual current value).

This code

// Thread.MemoryBarrier() or Console.WriteLine() fixes issue

May or may not fix an issue on different platforms. Memory barrier or console write just happen to force application to read fresh values on a particular system. It may not be the same elsewhere.

Additionally, volatile and Thread.MemoryBarrier() only provide weak guarantees, which means they don't provide 100% assurance that a read value will always be the latest on all systems and CPUs.

Eric Lippert says

The true semantics of volatile reads and writes are considerably more complex than I've outlined here; in fact they do not actually guarantee that every processor stops what it is doing and updates caches to/from main memory. Rather, they provide weaker guarantees about how memory accesses before and after reads and writes may be observed to be ordered with respect to each other. Certain operations such as creating a new thread, entering a lock, or using one of the Interlocked family of methods introduce stronger guarantees about observation of ordering. If you want more details, read sections 3.10 and 10.5.3 of the C# 4.0 specification.

I am trying to understand why volatile is not 100% safe? I mean if a CPU architecture implements read barriers and write barriers effectively, shouldn't this be sufficient to ensure that there are no locally cached copies? — Michal Ciechan, Mar 18 '14 at 15:47
@MichalCiechan: It is enough to guarantee that there are no locally cached copies, yes. It is *not* enough to ensure that reads are not moved backwards in time with respect to writes. On March 26th on the Coverity Development Testing Blog I'll be publishing an article which illustrates why out-of-order reads and writes can be surprising. **Volatile is a sharp tool**. Don't use it unless you're an expert on how processors can reorder instructions. (I am not such an expert, so I don't use volatile.) — Eric Lippert, Mar 18 '14 at 19:31
`Thread.MemoryBarrier` is a real fix. That is because it prevents the loop invariant code motion optimzation. If it did not fix this trivial case then it would be essentially useless in the BCL. However, `Console.WriteLine` is definitely a fake fix. The OP mistakenly implied that I claimed it was a fix. I did no such thing in the cited answer from another question. — Brian Gideon, Jul 30 '14 at 16:32

score 1 · Answer 3 · answered Mar 18 '14 at 11:19

1

That example does not have anything to do with out-of-order execution. It only shows the effect of possible compiler optimizing away the stop access, which should be addressed by simply marking the variable volatile. See Memory Reordering Caught in the Act for a better example.

answered Mar 18 '14 at 11:19

Remus Rusanu

288,378
40
442
569

Great link, there is an awesome link give in a comment by the author of that port - Is Parallel Programming Hard - http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2011.01.02a.pdf – Michal Ciechan Mar 18 '14 at 15:44
[The Art of Multiprocessor Programming](http://www.amazon.com/The-Multiprocessor-Programming-Revised-Reprint/dp/0123973376) is also very good, for the algorithms. Be aware is Java so some algorithms rely on GC and cannot be ported directly to C/C++. – Remus Rusanu Mar 18 '14 at 15:50

score 1 · Answer 4 · edited May 23 '17 at 10:26

Let us start with some definitions. The volatile keyword produces an acquire-fence on reads and a release-fence on writes. These are defined as follows.

acquire-fence: A memory barrier in which other reads and writes are not allowed to move before the fence.
release-fence: A memory barrier in which other reads and writes are not allowed to move after the fence.

The method Thread.MemoryBarrier generates a full-fence. That means it produces both the acquire-fence and the release-fence. Frustratingly the MSDN says this though.

Synchronizes memory access as follows: The processor executing the current thread cannot reorder instructions in such a way that memory accesses prior to the call to MemoryBarrier execute after memory accesses that follow the call to MemoryBarrier.

Interpreting this leads us to believe that it only generates a release-fence though. So what is it? A full fence or half fence? That is probably a topic for another question. I am going to work under the assumption that it is a full fence because a lot of smart people have made that claim. But, more convincingly, the BCL itself uses Thread.MemoryBarrier as if it produced a full-fence. So in this case the documentation is probably wrong. Even more amusingly the statement actually implies that instructions before the call can somehow be sandwiched between the call and instructions after it. That would be absurd. I say this in jest (but not really) that it might benefit Microsoft to have a lawyer review all documentation regarding threading. I am sure their legalese skills could be put to good use in that area.

Now I am going to introduce an arrow notation to help illustrate the fences in action. An ↑ arrow will represent a release-fence and a ↓ arrow will represent an acquire-fence. Think of the arrow head as pushing memory access away in the direction of the arrow. But, and this is important, memory accesses can move past the tail. Read the definitions of the fences above and convince yourself that the arrows visually represent those definitions.

Next we will analyze the loop only as that is the most important part of the code. To do this I am going to unwind the loop. Here is what it looks like.

LOOP_TOP:

// Iteration 1
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP

// Iteration 2
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP

...

// Iteration N
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP

LOOP_BOTTOM:

Notice that the call to Thread.MemoryBarrier is constraining the movement of some of the memory access. For example, the read of toggle cannot move before the read of stop or vice-versa because those memory access are not allowed to move through an arrow head.

Now imagine what would happen if the full-fence were removed. The C# compiler, JIT compiler, or hardware are now have a lot more liberty in the moving the instructions around. In particular the lifting optimization, know formally as loop invariant code motion, is now allowed. Basically the compiler detects that stop is never modified and so the read is bubbled up out of the loop. It is now effectively cached into a register. If the memory barrier were in place then the read would have to push up through an arrow head and the specification specifically disallows that. This is much easier to visualize if you unwind the loop like I did above. Remember, the call to Thread.MemoryBarrier would be occurring on every iteration of the loop so you cannot simply draw conclusions about what would happen from only a single iteration.

The astute reader will notice that the compiler is free to swap the read of toggle and stop in such a manner that stop gets "refreshed" at the end of the loop instead of the beginning, but that is irrelevant to the contextual behavior of the loop. It has the exact same semantics and produces the same result.

My question is why, without adding a Thread.MemoryBarrier(), or even Console.WriteLine() in the while loop fixes the issue?

Because the memory barrier places restrictions on the optimizations the compiler can perform. It would disallow loop invariant code motion. The assumption is that Console.WriteLine produces a memory barrier which is probably true. Without the memory barrier the C# compiler, JIT compiler, or hardware are free to hoist the read of stop up and outside of the loop itself.

I am guessing that because on a multi processor machine, the thread runs with its own cache of values, and never retrieves the updated value of stop because it has its value in cache?

In a nutshell...yes. Though keep in mind that it has nothing to do with the number of processors. This can be demonstrated with a single processor.

Or is it that the main thread does not commit this to memory?

No. The main thread will commit the write. The call to Thread.Join ensures that because it will create a memory barrier that disallows the movement of the write to fall below the join.

Also why does Console.WriteLine() fix this? Is it because it also implements a MemoryBarrier?

Yes. It probably produces a memory barrier. I have been keeping a list of memory barrier generators here.

Explanation of Thread.MemoryBarrier() Bug with OoOP

4 Answers4