When to use volatile to counteract compiler optimizations in C#

Question

I have spent an extensive number of weeks doing multithreaded coding in C# 4.0. However, there is one question that remains unanswered for me.

I understand that the volatile keyword prevents the compiler from storing variables in registers, thus avoiding inadvertently reading stale values. Writes are always volatile in .Net, so any documentation stating that it also avoids stales writes is redundant.

I also know that the compiler optimization is somewhat "unpredictable". The following code will illustrate a stall due to a compiler optimization (when running the release compile outside of VS):

class Test
{
    public struct Data
    {
        public int _loop;
    }

    public static Data data;

    public static void Main()
    {
        data._loop = 1;
        Test test1 = new Test();

        new Thread(() =>
        {
            data._loop = 0;
        }
        ).Start();

        do
        {
            if (data._loop != 1)
            {
                break;
            }

            //Thread.Yield();
        } while (true);

        // will never terminate
    }
}

The code behaves as expected. However, if I uncomment out the //Thread.Yield(); line, then the loop will exit.

Further, if I put a Sleep statement before the do loop, it will exit. I don't get it.

Naturally, decorating _loop with volatile will also cause the loop to exit (in its shown pattern).

My question is: What are the rules the complier follows in order to determine when to implicity perform a volatile read? And why can I still get the loop to exit with what I consider to be odd measures?

EDIT

IL for code as shown (stalls):

L_0038: ldsflda valuetype ConsoleApplication1.Test/Data ConsoleApplication1.Test::data
L_003d: ldfld int32 ConsoleApplication1.Test/Data::_loop
L_0042: ldc.i4.1 
L_0043: beq.s L_0038
L_0045: ret

IL with Yield() (does not stall):

L_0038: ldsflda valuetype ConsoleApplication1.Test/Data ConsoleApplication1.Test::data
L_003d: ldfld int32 ConsoleApplication1.Test/Data::_loop
L_0042: ldc.i4.1 
L_0043: beq.s L_0046
L_0045: ret 
L_0046: call bool [mscorlib]System.Threading.Thread::Yield()
L_004b: pop 
L_004c: br.s L_0038

For the sleep, likely the sleep happens before _loop is read into the register and gives the other thread time to change _loop. Then again who knows, attach the debugger *after* running it (do not start with debugging) and look at the disassembly to be sure. — harold, Dec 07 '11 at 11:23
For the yield (or any other non-inlined call), likely the JIT compiler realized that spilling the variable to the stack (if it was in a caller-save register) makes less sense than just reading it again. Then again who knows, see previous comment. — harold, Dec 07 '11 at 11:26
@harold I agree that putting a sleep before the do loop simply allows the register to update in time. That one is easy. — IamIC, Dec 07 '11 at 11:31
If you put a big calculation in that loop it also (usually?) exits. — harold, Dec 07 '11 at 11:50
Could you post the actual disassembly? The MSIL code is fairly useless in this case. — harold, Dec 07 '11 at 12:12
Run the program outside the debugger, then attach the debugger manually. To make that easier, I usually throw an exception from an `if` that can not be optimized away or I intentionally pass an argument that will cause an exception to be thrown. — harold, Dec 07 '11 at 14:02
By the way, are you interested in the theoretical guarantees or in "what happens in practice"? There's a considerable difference in this case. — harold, Dec 07 '11 at 15:32
I am interested in what happens in practice. The point is I have some code that requires volatile, and some that doesn't, and I'd like to know why. — IamIC, Dec 07 '11 at 18:51

score 13 · Accepted Answer · edited May 23 '17 at 10:28

What are the rules the complier follows in order to determine when to implicity perform a volatile read?

First, it is not just the compiler that moves instructions around. The big 3 actors in play that cause instruction reordering are:

Compiler (like C# or VB.NET)
Runtime (like the CLR or Mono)
Hardware (like x86 or ARM)

The rules at the hardware level are a little more cut and dry in that they are usually documented pretty well. But, at the runtime and compiler levels there are memory model specifications that provide constraints on how instructions can get reordered, but it is left up to the implementers to decide how aggressively they want to optimize the code and how closely they want to toe the line with respect to the memory model constraints.

For example, the ECMA specification for the CLI provides fairly weak guarantees. But Microsoft decided to tighten those guarantees in the .NET Framework CLR. Other than a few blog posts I have not seen much formal documentation on the rules the CLR adheres to. Mono, of course, might use a different set of rules that may or may not bring it closer to the ECMA specification. And of course, there may be some liberty in changing the rules in future releases as long as the formal ECMA specification is still considered.

With all of that said I have a few observations:

Compiling with the Release configuration is more likely to cause instruction reordering.
Simpler methods are more likely to have their instructions reordered.
Hoisting a read from inside a loop to outside of the loop is a typical type of reordering optimization.

And why can I still get the loop to exit with what I consider to be odd measures?

It is because those "odd measures" are doing one of two things:

generating an implicit memory barrier
circumventing the compiler's or runtime's ability to perform certain optimizations

For example, if the code inside a method gets too complex it may prevent the JIT compiler from performing certain optimizations that reorders instructions. You can think of it as sort of like how complex methods also do not get inlined.

Also, things like Thread.Yield and Thread.Sleep create implicit memory barriers. I have started a list of such mechanisms here. I bet if you put a Console.WriteLine call in your code it would also cause the loop to exit. I have also seen the "non terminating loop" example behave differently in different versions of the .NET Framework. For example, I bet if you ran that code in 1.0 it would terminate.

This is why using Thread.Sleep to simulate thread interleaving could actually mask a memory barrier problem.

Update:

After reading through some of your comments I think you may be confused as to what Thread.MemoryBarrier is actually doing. What it is does is it creates a full-fence barrier. What does that mean exactly? A full-fence barrier is the composition of two half-fences: an acquire-fence and a release-fence. I will define them now.

Acquire fence: A memory barrier in which other reads & writes are not allowed to move before the fence.
Release fence: A memory barrier in which other reads & writes are not allowed to move after the fence.

So when you see a call to Thread.MemoryBarrier it will prevent all reads & writes from being moved either above or below the barrier. It will also emit whatever CPU specific instructions are required.

If you look at the code for Thread.VolatileRead here is what you will see.

public static int VolatileRead(ref int address)
{
    int num = address;
    MemoryBarrier();
    return num;
}

Now you may be wondering why the MemoryBarrier call is after the actual read. Your intuition may tell you that to get a "fresh" read of address you would need the call to MemoryBarrier to occur before that read. But, alas, your intuition is wrong! The specification says a volatile read should produce an acquire-fence barrier. And per the definition I gave you above that means the call to MemoryBarrier has to be after the read of address to prevent other reads and writes from being moved before it. You see volatile reads are not strictly about getting a "fresh" read. It is about preventing the movement of instructions. This is incredibly confusing; I know.

@Brian your explanation of the memory fence is spot-on. But I would say that volatile is about getting a "fresh" read because it prevents the use of registers. — IamIC, Dec 07 '11 at 16:48
@IanC: Hmm...not exactly. A fresh read would be a typical consequence of `volatile`, but it is not technically stated in the in the ECMA specification. Take a look at how `Thread.VolatileRead` is implemented and consider how things would play out if 2 subsequent calls to it were made with the same `address`. The first call might not be "fresh", but the second certainly would. The first read could be satisfied from the executing thread's write queue. At least the specification says this is possible. — Brian Gideon, Dec 07 '11 at 20:15
@BrianGideon Then how come the [`VolatileRead`](http://msdn.microsoft.com/en-us/library/system.threading.thread.volatileread(v=vs.110).aspx) docs state that the method will read an updated value, regardless of what has been cached? "The value is the latest written by any processor in a computer, regardless of the number of processors or the state of processor cache." This implementation doesn't seem to make such guarantees.. — dcastro, Feb 09 '14 at 15:10
@dcastro: The documentation is rather confusing on this point. Getting an updated value or "fresh read" isn't exactly what's going on. However, the end result is that you *usually* do get a "fresh read" because of how it is typically used. So in that regard the documentation half correct. But, it definitely does not match up with the documentation for `Thread.MemoryBarrier` and the specification for `volatile`. The specification has no mention whatsoever of "fresh reads" or "committed writes" so that is really the wrong way to look at things to begin with. — Brian Gideon, Feb 09 '14 at 23:14
I assume that by "the specification" you mean the actual c# specification document, not the MSDN page correct? I guess my suspicions were right then, the MSDN docs for volatile are kinda wrong.. I haven't yet read the specification, but I will. I have an ongoing discussion, would you mind weighing in? Thanks for the clarification. http://stackoverflow.com/q/21652938/857807 — dcastro, Feb 09 '14 at 23:18
@dcastro: Yes. That's exactly what I mean. You are not the only person who has noted the partial contradiction between the ECMA specification and the MSDN documentation. — Brian Gideon, Feb 09 '14 at 23:22

score 2 · Answer 2 · answered Dec 07 '11 at 11:24

2

Your sample runs unterminated (most of the time I think) because _loop can be cached.

Any of the 'solutions' you mentioned (Sleep, Yield) will involve a memory barrier, forcing the compiler to refresh _loop.

The minimal solution (untested):

    do
    {
       System.Threading.Thread.MemoryBarrier();

        if (data._loop != 1)
        {
            break;
        }
    } while (true);

answered Dec 07 '11 at 11:24

H H

263,252
30
330
514

I understand memory barriers. But even a Console.Beep() will do the trick. Surely that doesn't force a memory barrier? – IamIC Dec 07 '11 at 11:29
That makes sense, however it may not agree with the documentation of MemoryBarrier, which seems to imply it's just an mfence (or similar) and not special to the compiler. I may be reading it wrong though. – harold Dec 07 '11 at 11:30
Your solution works. I don't know why, though, since MemoryBarrier is designed to prevent reads from happening before writes have happened. There is no before/after code here. Once can only deduce that the complier is seeing the fence and "acting with caution". What I'd really love to know is what the compiler's rules are. – IamIC Dec 07 '11 at 11:34
The minimal solution is to mark _loop as volatile. But that isn't my question. – IamIC Dec 07 '11 at 11:37
@IanC: MemoryBarrier also flushes local cached values. – Tudor Dec 07 '11 at 11:40
@Tudor true... but would that update a register? Caches, yes. – IamIC Dec 07 '11 at 11:44
@IanC - you would have to look at the x86, I assume the Jitter understands it cannot optimize over a barrier. And this is only 'minimal' in the sense that it's local to the loop. – H H Dec 07 '11 at 11:46

score 2 · Answer 3 · answered Dec 07 '11 at 11:45

2

It is not only a matter of compiler, it can also be a matter of CPU, which also does it's own optimizations. Granted, generally a consumer CPU does not have so much liberty and usually the compiler is the one guilty for the above scenario.

A full fence is probably too heavy-weight for making a single volatile read.

Having said this, a good explanation of what optimization can occur is found here: http://igoro.com/archive/volatile-keyword-in-c-memory-model-explained/

answered Dec 07 '11 at 11:45

Tudor

61,523
12
102
142

That is a good article (I read it earlier this week). It doesn't explain why the compiler is using an implicit volatile simply because I added Yield() or even Console.Beep(). – IamIC Dec 07 '11 at 12:05

score 0 · Answer 4 · answered Dec 07 '11 at 16:14

There seems to be a lot of talk about memory barriers at the hardware level. Memory fences are irrelevant here. It's nice to tell the hardware not to do anything funny, but it wasn't planning to do so in the first place, because you are of course going to run this code on x86 or amd64. You don't need a fence here (and it is very rare that you do, though it can happen). All you need in this case is to reload the value from memory.
The problem here is that the JIT compiler is being funny, not the hardware.

In order to force the JIT to quit joking around, you need something that either (1) just plain happens to trick the JIT compiler into reloading that variable (but that's relying on implementation details) or that (2) generates a memory barrier or read-with-acquire of the kind that the JIT compiler understands (even if no fences end up in the instruction stream).

To address your actual question, there are only actual rules about what should happen in case 2.

When to use volatile to counteract compiler optimizations in C#

4 Answers4

Linked