Why we need Thread.MemoryBarrier()?

Question

In "C# 4 in a Nutshell", the author shows that this class can write 0 sometimes without MemoryBarrier, though I can't reproduce in my Core2Duo:

public class Foo
{
    int _answer;
    bool _complete;
    public void A()
    {
        _answer = 123;
        //Thread.MemoryBarrier();    // Barrier 1
        _complete = true;
        //Thread.MemoryBarrier();    // Barrier 2
    }
    public void B()
    {
        //Thread.MemoryBarrier();    // Barrier 3
        if (_complete)
        {
            //Thread.MemoryBarrier();       // Barrier 4
            Console.WriteLine(_answer);
        }
    }
}

private static void ThreadInverteOrdemComandos()
{
    Foo obj = new Foo();

    Task.Factory.StartNew(obj.A);
    Task.Factory.StartNew(obj.B);

    Thread.Sleep(10);
}

This need seems crazy to me. How can I recognize all possible cases that this can occur? I think that if processor changes order of operations, it needs to guarantee that the behavior doesn't change.

Do you bother to use Barriers?

Related question: [How do I Understand Read Memory Barriers and Volatile](https://stackoverflow.com/questions/1787450/how-do-i-understand-read-memory-barriers-and-volatile) — jrh, Jun 08 '18 at 19:51
Another awesome read on the topic is [Joe Albahari's Threading in C# (part 4)](http://www.albahari.com/threading/part4.aspx) — Stas Ivanov, Jan 16 '22 at 16:50

score 77 · Accepted Answer · edited May 23 '17 at 11:54

You are going to have a very hard time reproducing this bug. In fact, I would go as far as saying you will never be able to reproduce it using the .NET Framework. The reason is because Microsoft's implementation uses a strong memory model for writes. That means writes are treated as if they were volatile. A volatile write has lock-release semantics which means that all prior writes must be committed before the current write.

However, the ECMA specification has a weaker memory model. So it is theoretically possible that Mono or even a future version of the .NET Framework might start exhibiting the buggy behavior.

So what I am saying is that it is very unlikely that removing barriers #1 and #2 will have any impact on the behavior of the program. That, of course, is not a guarantee, but an observation based on the current implementation of the CLR only.

Removing barriers #3 and #4 will definitely have an impact. This is actually pretty easy to reproduce. Well, not this example per se, but the following code is one of the more well known demonstrations. It has to be compiled using the Release build and ran outside of the debugger. The bug is that the program does not end. You can fix the bug by placing a call to Thread.MemoryBarrier inside the while loop or by marking stop as volatile.

class Program
{
    static bool stop = false;

    public static void Main(string[] args)
    {
        var t = new Thread(() =>
        {
            Console.WriteLine("thread begin");
            bool toggle = false;
            while (!stop)
            {
                toggle = !toggle;
            }
            Console.WriteLine("thread end");
        });
        t.Start();
        Thread.Sleep(1000);
        stop = true;
        Console.WriteLine("stop = true");
        Console.WriteLine("waiting...");
        t.Join();
    }
}

The reason why some threading bugs are hard to reproduce is because the same tactics you use to simulate thread interleaving can actually fix the bug. Thread.Sleep is the most notable example because it generates memory barriers. You can verify that by placing a call inside the while loop and observing that the bug goes away.

You can see my answer here for another analysis of the example from the book you cited.

@Brian Gideon May be it is very silly, but I don't understand why the program in your example never ends! Why `while (!stop)` never get `false`? Can you explain a bit more? Or can you please suggest some detailed blog? I read the article OP posted (albahari example), I got confused there too; to me, barrier 1 is enough, why there are other barriers? MSDN says, `MemoryBarrier` prevents instruction re-ordering. Then why only barrier 1 is not sufficient (because `completed` can't be executed before `answer` is set)? I really don't get it. — mshsayem, Feb 06 '16 at 17:08
@mshsayem There's a couple reasons. 1) In the thread's function, as far as the compiler is concerned, `while(!stop)` is going to be always true, the compiler will see that the thread function won't manipulate the variable and will ignore that check (unless you do `lock` or `Interlocked.Read` or similar). The compiler does not guarantee that the emitted IL will not optimize that out unless you explicitly make it so. — jrh, Jun 08 '18 at 19:20
@mshsayem 2) Some older CPUs (e.g., IA64, AKA "Itanium") did not guarantee cache coherency (on the hardware level) between threads, as in, if the code above happened to be running on two different cores, there is no guarantee that the CPU running the other thread would bother asking the other CPU what it set `stop` to and assume that its cached value (always false) is right. This is what the `lock` ASM prefix was introduced for, though as far as I know this is not as relevant for AMD64 CPUs which have stronger memory guarantees. — jrh, Jun 08 '18 at 19:21
A mental model that might be helpful is to think of it like this: Your threads don't all share the same pile of memory, they have their own local memory and only read or write shared memory that's accessible by other threads when you explicitly direct them to (by using `lock`, or some fancy combination of `Monitor`, `MemoryFence`, `Interlocked`, etc, note that the latter is definitely language lawyer territory). Even if thread A writes to `stop`, 1) it may not actually write that to RAM, and 2) even if it did, there's no guarantee that thread B will bother to read the value of `stop` from RAM — jrh, Jun 08 '18 at 19:42

score 11 · Answer 2 · answered Aug 24 '10 at 12:48

11

Odds are very good that the first task is completed by the time the 2nd task even starts running. You can only observe this behavior if both threads run that code simultaneously and there's no intervening cache-synchronizing operations. There is one in your code, the StartNew() method will take a lock inside the thread pool manager somewhere.

Getting two threads to run this code simultaneously is very hard. This code completes in a couple of nanoseconds. You would have to try billions of times and introduce variable delays to have any odds. Not much point to this of course, the real problem is when this happens randomly when you don't expect it.

Stay away from this, use the lock statement to write sane multi-threaded code.

answered Aug 24 '10 at 12:48

Hans Passant

922,412
146
1,693
2,536

The problem is that I dont need lock here. Method B should never write 0 since _complete just become true after set _answer. But is just a example. – Felipe Pessoto Aug 24 '10 at 12:56
1

@Fujiy: If you remove barrier 1 then the compiler/jitter/processor might re-order `A` so that `_complete` is set to `true` *before* `_answer` is set to `123`. ie, `B` could see that `_complete` is `true` and subsequently read `0` from answer. – LukeH Aug 24 '10 at 13:06
@Fujiy: Likewise if you remove barrier 4: the compiler/jitter/processor might re-order `B` so that `_answer` is read *before* `_complete`. ie, `B` could read `0` from answer while `_complete` is `false` and then subsequently read `true` from `_complete` *after* it has been updated. – LukeH Aug 24 '10 at 13:09
I understand. Maybe this is no easy way to make all consistent, just avoid shared memory. Most programmer can read this code and don´t see the race condition, because its a implementation detail of jit/processor. If I put a lock just at "A" method, B still could write 0? – Felipe Pessoto Aug 24 '10 at 13:35
I would use an AutoResetEvent in this specific case, replacing _complete. ARE includes the barrier. – Hans Passant Aug 24 '10 at 13:44
1

To clarify, the reason I'm not suggest an ARE or any other lock is that the code intentionally allows for a race condition. There's no guarantee that A completes before B. All that's required is that, if B does detect the completion of A, it also correctly gets the answer A generated. – Steven Sudit Aug 24 '10 at 13:56
Volatile is much too weak, it doesn't include a barrier. The MSDN docs for it only apply to Titanium. – Hans Passant Aug 24 '10 at 14:04
If I understand correctly, volatile would not prevent reordering of writes (which is why we need barrier 1), but it would prevent caching of stale results by B. – Steven Sudit Aug 24 '10 at 14:19
@Steven: I think that's correct. Except if you have a weaker memory model variant of the CLR then #1 does prevent the writes to `_answer` and `_complete` from being swapped. Do you agree? – Brian Gideon Aug 24 '10 at 14:38
@Brian: Yes, #1 is the only one that really, really matters (if the memory model would otherwise allow out-of-order writes). It matters because we absolutely can't allow `complete` to be set until `answer` is. Then again, if B can do out-of-order reads, it could have already read `answer` before checking `complete`. That's where `volatile` comes in, preventing the stale read of `answer`. – Steven Sudit Aug 24 '10 at 16:04
The Itanium was a disaster in more than one way. It will take *years* to get rid of these considerations. – Hans Passant Aug 24 '10 at 16:11
1

@StevenSudit Using `volatile` alone for thread synchronization is probably not a good idea, it only protects you from a few compiler/jit optimizations, it doesn't do much for the memory model. – jrh Jun 08 '18 at 19:33

score 2 · Answer 3 · answered Aug 24 '10 at 12:28

2

If you use volatile and lock, the memory barrier is built in. But, yes, you do need it otherwise. Having said that, I suspect that you need half as many as your example shows.

answered Aug 24 '10 at 12:28

Steven Sudit

19,391
1
51
53

Yes, the author says that Barriers 2 e 3 is just to guarantee that if A run before B, B enter the if – Felipe Pessoto Aug 24 '10 at 12:53
They're all theorectically needed if a lock-free strategy is desired. – Brian Gideon Aug 24 '10 at 13:48
1

@Brian: 1 and 4 are needed. I'm not so sure about 2 and 3. It looks as though the worst case is that it loses a race condition that it might or might not have otherwise won. – Steven Sudit Aug 24 '10 at 13:52
@Fujiy: There's absolutely no guarantee that A will complete before B. IF we wanted that, we would need A to acquire a lock and B to wait on it. Otherwise, it is a race condition. What matters for correctness is that, with 1 and 4 in place, there are no circumstances under which the wrong answer is printed. It's always possible that no answer is printed, as this is also correct. – Steven Sudit Aug 24 '10 at 13:55
@Steven: #2 is needed to commit (flush) the write to `_complete`. #3 is need to acquire (refresh) the read of `_complete`. #2 is needed purely on theorectical grounds. The current CLR implementation enforces volatile writes (and so does the x86 and x64 hardware) so removing it will most likely not cause any problems. But, if you contrive a slightly different example with a `while` loop it should be easy to replicate the bug if #3 is missing. – Brian Gideon Aug 24 '10 at 14:09
@Brian: Right, but if #2 doesn't run, then the worst case is that the write is delayed and the completeness test fails. This is the same result if #3 isn't run. It's not a bug, though. There's really no reason to expect A to complete before B. The only bug would be if B saw A as completed, but got the wrong answer. – Steven Sudit Aug 24 '10 at 14:17
1

@Steven: Yeah, I see your point now. It really boils down to your definition of a bug :) Along the same pandantic lines, #2 is not needed anyway because thread A terminates immediately which generates a barrier automatically. So for various different reasons (including the one you just mentioned) the author's example doesn't quite do justice to explaining all the caveats with lock-free synchronization. – Brian Gideon Aug 24 '10 at 14:30
@Brian: Well, even if A went and did some unrelated work for a few minutes, instead of just terminating, what's the worst case if we remove #2? Nothing: the predicate in B would simply be false. I'm starting to think that the *only* barrier that's strictly needed is #1. – Steven Sudit Aug 24 '10 at 14:32
1

@Steven: Absolutely! And using a more strict definition of a bug then I agree #1 is the most crucial (except that even it is not mandatory since the CLR already treats writes as volatile). – Brian Gideon Aug 24 '10 at 14:33
@Brian: Looks like the problem is real, but the example source code is not. :-) – Steven Sudit Aug 24 '10 at 14:39

score 2 · Answer 4 · answered Aug 24 '10 at 12:31

2

Its very difficult to reproduce multithreaded bugs - usually you have to run the test code many times (thousands) and have some automated check that will flag if the bug occurs. You might try to add a short Thread.Sleep(10) in between some of the lines, but again it not always guarantees that you will get the same issues as without it.

Memory Barriers were introduced for people who need to do really hardcore low-level performance optimisation of their multithreaded code. In most cases you will be better off when using other synchronisation primitives, i.e. volatile or lock.

answered Aug 24 '10 at 12:31

Grzenio

35,875
47
158
240

10

The problem with `Thread.Sleep` is that it can generate a memory barrier. So using that mechanism to try to reproduce threading bugs can actually fix the bug. – Brian Gideon Aug 24 '10 at 13:52
Combinatorial tests are a relatively good way to test [on a specific architecture/build]. Can easily vary: number of threads (and making sure they 'start' about the same time), work done in each thread (and ratio of access), chance of Thread.Yield (eg. percentage), etc.. the tests should be able to run millions of iterations in just a few seconds. A slightly more "devious" problem than having Thread.Sleep generate a memory barrier is thread-safe code attempting to get data back from threads causing memory barriers and other 'only test' artifacts.. self-defeating reporting D: – user2864740 Oct 17 '18 at 18:53

score 2 · Answer 5 · answered Sep 21 '17 at 05:19

I'll just quote one of the great articles on multi-threading:

Consider the following example:

class Foo
{
  int _answer;
  bool _complete;

  void A()
  {
    _answer = 123;
    _complete = true;
  }

  void B()
  {
    if (_complete) Console.WriteLine (_answer);
  }
}

If methods A and B ran concurrently on different threads, might it be possible for B to write “0”? The answer is yes — for the following reasons:

The compiler, CLR, or CPU may reorder your program's instructions to improve efficiency. The compiler, CLR, or CPU may introduce caching optimizations such that assignments to variables won't be visible to other threads right away. C# and the runtime are very careful to ensure that such optimizations don’t break ordinary single-threaded code — or multithreaded code that makes proper use of locks. Outside of these scenarios, you must explicitly defeat these optimizations by creating memory barriers (also called memory fences) to limit the effects of instruction reordering and read/write caching.

Full fences

The simplest kind of memory barrier is a full memory barrier (full fence) which prevents any kind of instruction reordering or caching around that fence. Calling Thread.MemoryBarrier generates a full fence; we can fix our example by applying four full fences as follows:

class Foo
{
  int _answer;
  bool _complete;

  void A()
  {
    _answer = 123;
    Thread.MemoryBarrier();    // Barrier 1
    _complete = true;
    Thread.MemoryBarrier();    // Barrier 2
  }

  void B()
  {
    Thread.MemoryBarrier();    // Barrier 3
    if (_complete)
    {
      Thread.MemoryBarrier();       // Barrier 4
      Console.WriteLine (_answer);
    }
  }
}

All the theory behind Thread.MemoryBarrier and why we need to use it in non-blocking scenarios to make the code safe and robust is described nicely here: http://www.albahari.com/threading/part4.aspx

score 1 · Answer 6 · answered Aug 24 '10 at 12:34

If you are ever touching data from two different threads, this can occur. This is one of the tricks that processors use to increase speed - you could build processors that didn't do this, but they would be much slower, so no one does that anymore. You should probably read something like Hennessey and Patterson to recognize all of the various types of race conditions.

I always use some sort of higher level tool like a monitor or a lock, but internally they are doing something similar or are implemented with barriers.

Why we need Thread.MemoryBarrier()?

6 Answers6

Linked

Related