Why is a volatile local variable optimised differently from a volatile argument, and why does the optimiser generate a no-op loop from the latter?

Question

Background

This was inspired by this question/answer and ensuing discussion in the comments: Is the definition of “volatile” this volatile, or is GCC having some standard compliancy problems?. Based on others' and my interpretation of what should happening, as discussed in comments, I've submitted it to GCC Bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71793 Other relevant responses are still welcome.

Also, that thread has since given rise to this question: Does accessing a declared non-volatile object through a volatile reference/pointer confer volatile rules upon said accesses?

Intro

I know volatile isn't what most people think it is and is an implementation-defined nest of vipers. And I certainly don't want to use the below constructs in any real code. That said, I'm totally baffled by what's going on in these examples, so I'd really appreciate any elucidation.

My guess is this is due to either highly nuanced interpretation of the Standard or (more likely?) just corner-cases for the optimiser used. Either way, while more academic than practical, I hope this is deemed valuable to analyse, especially given how typically misunderstood volatile is. Some more data points - or perhaps more likely, points against it - must be good.

Input

Given this code:

#include <cstddef>

void f(void *const p, std::size_t n)
{
    unsigned char *y = static_cast<unsigned char *>(p);
    volatile unsigned char const x = 42;
    // N.B. Yeah, const is weird, but it doesn't change anything

    while (n--) {
        *y++ = x;
    }
}

void g(void *const p, std::size_t n, volatile unsigned char const x)
{
    unsigned char *y = static_cast<unsigned char *>(p);

    while (n--) {
        *y++ = x;
    }
}

void h(void *const p, std::size_t n, volatile unsigned char const &x)
{
    unsigned char *y = static_cast<unsigned char *>(p);

    while (n--) {
        *y++ = x;
    }
}

int main(int, char **)
{
    int y[1000];
    f(&y, sizeof y);
    volatile unsigned char const x{99};
    g(&y, sizeof y, x);
    h(&y, sizeof y, x);
}

Output

g++ from gcc (Debian 4.9.2-10) 4.9.2 (Debian stable a.k.a. Jessie) with the command line g++ -std=c++14 -O3 -S test.cpp produces the below ASM for main(). Version Debian 5.4.0-6 (current unstable) produces equivalent code, but I just happened to run the older one first, so here it is:

main:
.LFB3:
    .cfi_startproc

# f()
    movb    $42, -1(%rsp)
    movl    $4000, %eax
    .p2align 4,,10
    .p2align 3
.L21:
    subq    $1, %rax
    movzbl  -1(%rsp), %edx
    jne .L21

# x = 99
    movb    $99, -2(%rsp)
    movzbl  -2(%rsp), %eax

# g()
    movl    $4000, %eax
    .p2align 4,,10
    .p2align 3
.L22:
    subq    $1, %rax
    jne .L22

# h()
    movl    $4000, %eax
    .p2align 4,,10
    .p2align 3
.L23:
    subq    $1, %rax
    movzbl  -2(%rsp), %edx
    jne .L23

# return 0;
    xorl    %eax, %eax
    ret
    .cfi_endproc

Analysis

All 3 functions are inlined, and both that allocate volatile local variables do so on the stack for fairly obvious reasons. But that's about the only thing they share...

f() ensures to read from x on each iteration, presumably due to its volatile - but just dumps the result to edx, presumably because the destination y isn't declared volatile and is never read, meaning changes to it can be nixed under the as-if rule. OK, makes sense.
- Well, I mean... kinda. Like, not really, because volatile is really for hardware registers, and clearly a local value can't be one of those - and can't otherwise be modified in a volatile way unless its address is passed out... which it's not. Look, there's just not a lot of sense to be had out of volatile local values. But C++ lets us declare them and tries to do something with them. And so, confused as always, we stumble onwards.
g(): What. By moving the volatile source into a pass-by-value parameter, which is still just another local variable, GCC somehow decides it's not or less volatile, and so it doesn't need to read it every iteration... but it still carries out the loop, despite its body now doing nothing.
h(): By taking the passed volatile as pass-by-reference, the same effective behaviour as f() is restored, so the loop does volatile reads.
- This case alone actually makes practical sense to me, for reasons outlined above against f(). To elaborate: Imagine x refers to a hardware register, of which every read has side-effects. You wouldn't want to skip any of those.

Adding #define volatile /**/ leads to main() being a no-op, as you'd expect. So, when present, even on a local variable volatile does do something... I just have no idea what in the case of g(). What on Earth is going on there?

Questions

Why does a local value declared in-body produce different results from a by-value parameter, with the former letting reads be optimised away? Both are declared volatile. Neither have an address passed out - and don't have a static address, ruling out any inline-ASM POKEry - so they can never be modified outwith the function. The compiler can see that each is constant, need never be re-read, and volatile just ain't true -
- so (A) is either allowed to be elided under such constraints? (acting as-if they weren't declared volatile) -
- and (B) why does only one get elided? Are some volatile local variables more volatile than others?
Setting aside that inconsistency for just a moment: After the read was optimised away, why does the compiler still generate the loop? It does nothing! Why doesn't the optimiser elide it as-if no loop was coded?

Is this a weird corner case due to order of optimising analyses or such? As the code is a daft thought-experiment, I wouldn't chastise GCC for this, but it'd be good to know for sure. (Or is g() the manual timing loop people have dreamt of all these years?) If we conclude there's no Standard bearing on any of this, I'll move it to their Bugzilla just for their information.

And of course, the more important question from a practical perspective, though I don't want that to overshadow the potential for compiler geekery... Which, if any of these, are well-defined/correct according to the Standard?

TL;DR - If it doesn't change the observable behavior of the program does it really matter? — Captain Obvlious, Jul 06 '16 at 23:00
The C++11 standard (I assume C++14 too) says "Access to volatile objects are evaluated strictly according to the rules of the abstract machine". In other words the "as-if" rule doesn't apply - you have to follow the abstract machine rules strictly. I think the behavior of `g()` violates this; I'd guess this is an optimizer bug. I'd also guess that most people would say it's a low priority bug, and that some (many?) will disagree that it's a bug in the first place. — Michael Burr, Jul 06 '16 at 23:25
@MichaelBurr Nice quote, not only for its relevance but also for the fantastic term "abstract machine". I deal with a lot of those. I'm inclined to think it is a bug, in the sense of being an oversight, insofar as one would expect the 2 types of local variable to behave identically - as by my grok, the only discernible difference is the time at which they're constructed. However, due to the complexity of optimising and countless permutations of ordering different stages etc, it's obviously not that simple. I'd never dare to class it a `>ittybitty`-priority bug, but a conclusion would be great. — underscore_d, Jul 06 '16 at 23:45
@MichaelBurr Also, while applicable to `volatile`-declared objects, 'as-if' is still AOK for others, right? It seems fair that as `y` isn't used, the compiler bins writes to it. Why it still reads from `x` is nebulous: With `x` a reference, that makes sense, in case it's a hardware register whose polling has side-effects. But that doesn't apply to the by-value cases...which nonetheless differ. Aside: whether objects can/should be 'made' volatile thru `volatile *` seems contested: top answer at linked Q says no, but N1381 seems to imply yes: http://open-std.org/jtc1/sc22/wg14/www/docs/n1381.pdf — underscore_d, Jul 07 '16 at 00:32
@CaptainObvlious modifications to volatile variables (even automatic ones) are considered observable behaviour — M.M, Jul 07 '16 at 00:57
@M.M Is there a formal statement on whether reads are considered to have observable effects too? I envision that - depending on the device and how it's set up electronically vs how it interacts with the language - such reads might have side-effects on whichever register or device they're polling. You wouldn't want to skip those by eliding any reads. — underscore_d, Jul 07 '16 at 01:04
@M.M Thanks for all the input! I'll wait a while longer, and assuming your analysis prevails (fwiw, it seems clearly correct to me), submit it to GCC's Bugzilla, unless one of them see it here first. — underscore_d, Jul 07 '16 at 01:10
@M.M No, `g` is not a compiler bug. Say the assembly code issued individual reads but the CPU hardware coalesced them, would you still say it's a compiler bug? And how is the optimization any more or less observable if done by the CPU than if it's done by the compiler? The C standard does not say what the assembly code must look like (how could it?), it says what the assembly code must make the system do. Any optimization the CPU can do can also be done by the assembler. Coalescing reads to unshareable, cacheable values is a legal CPU optimization, so a legal compiler optimization. — David Schwartz, Jul 07 '16 at 01:21
@DavidSchwartz but surely the system might be affected by the presence or absence of read operations, making elision thereof consequential for what the system does? and any compiler/CPU involved in code as esoteric as this would surely be very carefully selected to work in a very precise way (bearing in mind how implementation-defined `volatile` is) - so it wouldn't just issue spurious reads or haphazardly combine them. That seems well within the ballpark of `volatile` to me, at least for such a system. Besides - none of that addresses why `f()` and `g()` differ or which of the two is correct. — underscore_d, Jul 07 '16 at 01:28
@DavidSchwartz The standard says that the system must perform a read of a memory location corresponding to `x`, once for each loop iteration. It would be non-conforming if the system (be it the compiler, or the CPU or whatever) combined all of those to a single read. — M.M, Jul 07 '16 at 01:32
@DavidSchwartz since when are `volatile` referents "unshareable, cacheable values"? isn't the reality _precisely the opposite_? `volatile` innately implies shareable (obligatory: though not thread-safe) in the strict sense: that the program must re-read the physical memory location whenever asked to, in case something else has changed it. what could be more shareable and less cacheable than that? — underscore_d, Jul 07 '16 at 01:40
In case anyone is at risk of being misled, here are quotes from both Standards requiring that the sequence of reads and writes to any given `volatile` variable be preserved, i.e. every operation must remain _and_ must occur in its written sequence. http://stackoverflow.com/questions/2535148/volatile-qualifier-and-compiler-reorderings — underscore_d, Jul 07 '16 at 02:08
@M.M Then most modern compilers and systems are non-conforming because they do not prevent the CPU from coalescing reads of `volatile`s declared on the stack and unshared. — David Schwartz, Jul 07 '16 at 02:58
@underscore_d Then no modern compiler on x86 CPUs is conforming, because they do not prevent the CPU from coalescing reads. (But that's an absurdly incorrect reading on the standard.) (In this particular case, the compiler happens to know that this particular variable is uncacheable and unshareable. Regardless of the fact that it's been declared `volatile`. The CPU can coalesce reads in this case, and the compiler cannot stop it. If you were reading the standard right, the majority of modern platforms would be violating the standard.) — David Schwartz, Jul 07 '16 at 02:59
@DavidSchwartz if anything, you're making a case that this is a dumb requirement in the standard — M.M, Jul 07 '16 at 03:07
@M.M No, it's not dumb at all. It's just very frequently misunderstood. Ask a separate question (about why it is not the case that every modern compiler on x86 is violating the standard) and I'll explain it in more detail. — David Schwartz, Jul 07 '16 at 04:07
now at GCC Bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71793 — underscore_d, Jul 07 '16 at 12:38
@M.M I'm fairly certain that the standard doesn't impose requirements on the behavior of the CPU, at least in this case. If my C/C++ code has 5 volatile reads to a eg: a hardware register, the compiler must produce machine code does that does exactly 5 reads to that address. The CPU is free to coalesce those reads, or do complicated hardware side effects (although the only example I know that actually happens is the register being cleared). — mbrig, Jul 07 '16 at 16:41
@mbrig That does not make sense. If the compiler produced instructions that caused the CPU to coalesce the reads, then the compiler didn't produce instructions that did exactly those five reads. The standard isn't talking about what the instruction stream has to be, it's talking about what the system has to do. What specific codes the compiler emits to make the CPU do what the standard says it has to do is an implementation detail. — David Schwartz, Jul 07 '16 at 20:53
@mbrig For example, say there was a CPU that always reversed the execution order of two consecutive writes to the same address. If code did `i=5; i=3;`, if the compiler was going to emit both writes, it would have to emit the instruction for the `i=3;` first, otherwise the final value would be wrong. Surely you wouldn't argue that if `i` was `volatile`, on such a platform, it's okay to make the final value 5 because the CPU reversed the instructions, not the compiler. It's cannot be the order of the instructions that matters. — David Schwartz, Jul 07 '16 at 20:56
@DavidSchwartz The compiler has to produce code that makes the machine behave as if it did the correct equivalent of the c code. If coalescing reads does not alter the behavior of the c code, the compiler does not need to prevent the CPU from doing it (not that a CPU would ever do such a thing if it had effects). Both CPU and compiler need to do the correct thing, volatile reads cannot be coalesced by the compiler because the CPU is allowed to (and does, on many platforms) create side effects. — mbrig, Jul 07 '16 at 21:21
@mbrig Not altering the behavior of the C code is not what the standard requires. Accesses to `volatile`s are declared to be observable behavior and modifying the observable behavior is not permitted under the as-if rule, or any other rule. I can't make sense of your last sentence. Are you saying that the compiler cannot coalesce reads of `volatile`s even in cases where the CPU could coalesce them? If so, on what basis do you reach that conclusion? Why does it matter where the coalescing is done? — David Schwartz, Jul 07 '16 at 21:29
@mbrig The standard doesn't say anything at all about what instructions the compiler has to emit. That there is machine code is an implementation detail. The standard says what the machine has to do. All that matters is what those instructions make the CPU do. If read coalescing is legal, it can be done by the compiler, the CPU, or whatever. If read coalescing is illegal, it cannot be done by the system, regardless of how it's made to happen. That the instruction stream is where volatile accesses are to be observed is made up and found nowhere in the standard. — David Schwartz, Jul 07 '16 at 21:30
@DavidSchwartz: If the compiler either knew that there was no way the CPU could be configured that would not cause multiple reads to be equivalent to a single read, or else it expressly documented that it did not support CPU modes that would distinguish multiple reads from a single read, then the compiler could coalesce the reads itself provided that each such read continued to be regarded as a "side-effect" for purposes of preventing infinite-loop-related UB. Mere ignorance of any means by which single or multiple reads might be distinguishable, however, would not be sufficient to... — supercat, Jul 08 '16 at 00:05
...justify coalescing them at the compiler level unless the compiler expressly documented that such features, if they exist, will not be supported. — supercat, Jul 08 '16 at 00:05
@supercat Are you saying it should do this because it has any practical use? Or are you saying that while this would severely impact performance negatively, it nevertheless is what the standard requires? — David Schwartz, Jul 08 '16 at 00:30
@DavidSchwartz: If the compiler writer wants to document that certain CPU features are not supported, that shouldn't impact performance at all. Otherwise, the compiler writer should assume that a programmer who wants `volatile` knows something that the compiler doesn't. There are many places where it would be helpful to have "maybe" qualifiers on parameters to expressly allow the compiler to omit qualifiers when in-lining a function in cases where the arguments don't have them, or have return types qualified to match parameters (while the prototype of `strchr` couldn't be changed... — supercat, Jul 08 '16 at 00:53
...to use such a feature, it should logically be declared in such a fashion as to say that the "const"-ness of the returned pointer should match the const-ness of the one that's passed in). In the absence of such a thing, however, I would suggest that programmers are more interested in correctness than speed, and a compiler should trust that a programmer who uses `volatile` does so for a reason, and that a programmer who is interested in speed, writing two versions of a functions--one which accepts a `volatile`-qualified pointer and one which accepts a non-qualified pointer... — supercat, Jul 08 '16 at 00:55
...would be better than having a compiler try to guess whether the programmer doesn't really need something to be volatile-qualified in a particular case. Even in a non-exotic platform, a programmer might enable a signal handle during a small stretch of code and require that all accesses to a particular variable be treated as `volatile` while the handler is active. Provided that the last access to the variable before the handler is enabled is a volatile write, the first access after is either a write or a volatile read, and there are no non-qualified accesses while it is enabled... — supercat, Jul 08 '16 at 00:58
...I would suggest that a programmer should be able to achieve defined behavior without having to use wasteful `volatile` accesses at other times when the handler is not enabled. — supercat, Jul 08 '16 at 00:59
@DavidSchwartz "Are you saying that the compiler cannot coalesce reads of volatiles even in cases where the CPU could coalesce them?" Yes, exactly. The compiler is prohibited in most/many cases from coalescing volatile reads. The CPU has more information available to it, and can safely coalesce reads in certain cases. Eg: `volatile int* x = 0x10; *x; *x; int y =*x;` The compiler does not know if reading address 0x10 has a side effects, so it must issue all reads, in-order. The CPU is aware if reading 0x10 has side effects, or perhaps it is a read-only register, and can coalesce or not, safely. — mbrig, Jul 11 '16 at 14:43
@mbrig You cannot possibly make portable statements about what the compiler does and does not know. That's utterly absurd. Are you saying a CPU that permits interrogation of which memory ranges have side-effects cannot exist? — David Schwartz, Jul 11 '16 at 17:34
@DavidSchwartz I don't think that's relevant. The C and C++ standards state that reads to volatile memory are observable behavior ( "(1.9/6): The observable behavior of the abstract machine is its sequence of reads and writes to volatile data and calls to library I/O functions. "), so the compiler needs to preserve them. — mbrig, Jul 11 '16 at 17:52
@mbrig Then so does the CPU. The standards don't distinguish between the CPU and the compiler. (How could they? They just say what has to happen, not what does it or how.) But Intel CPUs aggregate reads. So either everyone is violating the standard or you are reading it wrong. — David Schwartz, Jul 11 '16 at 17:56
@DavidSchwartz They obviously don't aggregate reads when it has side effects, or when there's the possibility of interrupts modifying the memory outside of normal flow (or if they do, they document specific memory barriers that the compiler must issue to protect the reads). The compiler and CPU obviously preserve the order and number of volatile reads/writes (where it matters), or interrupt service routines wouldn't work, and the CPU/compiler combo would be unusable for many cases. — mbrig, Jul 11 '16 at 18:53
@mbrig Again, that can't be right. Preserving things "where it matters" is the as-if rule. And `volatile` is an explicit exception to the as-if rule since accesses to them is *defined* as observable behavior. As you are reading the spec, it says you can't touch them ever where it doesn't matter. (Again, I don't agree with the way you're reading the spec. But you are reading it as accesses to volatile variables are defined as observable behavior and so can't be changed even where it doesn't matter. If not, then please explain what you think the spec is saying.) — David Schwartz, Jul 11 '16 at 21:13
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/117026/discussion-between-david-schwartz-and-mbrig). — David Schwartz, Jul 11 '16 at 22:15

score 2 · Answer 1 · answered Jul 07 '16 at 14:59

2

For f: GCC eliminates the non-volatile stores (but not the loads, which can have side-effects if the source location is a memory mapped hardware register). There is really nothing surprising here.

For g: Because of the x86_64 ABI the parameter x of g is allocated in a register (i.e. rdx) and does not have a location in memory. Reading a general purpose register does not have any observable side effects so the dead read gets eliminted.

answered Jul 07 '16 at 14:59

avdgrinten

181
4

That sounds similar to what Richard Biener replied on my ticket - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71793 - but his reply and edits of the ticket, chiefly the tag `wrong-code`, indicate that he doesn't think this is OK. Do you? The allocation of `g(x)` in a register seems like a detail of the ABI - in which case, thanks for the mechanistic explanation - but not permission to break `volatile`. It looks like the compiler should alter its behaviour to behave properly in this case. – underscore_d Jul 07 '16 at 15:07
Well, what behavior do you expect? The compiler cannot issue a memory-read because there is no memory location to read from. The read operation of the abstract machine really IS a no-op here. It could copy the value to memory and read from that location but that would only make sense if `x` could actually escape `g`. – avdgrinten Jul 07 '16 at 15:15
Again setting aside the dubious utility of a `volatile` local variable, what I expect is the same as for any other `volatile`-declared object: that reads and writes must occur in memory and cannot be elided nor reordered (relative to others on the same object). So what I would expect is that `x` would _not_ be allocated in a register, rather on the stack, as in the other 2 cases. Can the compiler do that without breaking ABI? If so, I think that's what it should do. – underscore_d Jul 07 '16 at 15:17
2

The compiler cannot allocate `x` in memory because the ABI does not special case volatile arguments (because they do not make much sense) and just passes them in registers just like non-volatile arguments. Allocating `x` in memory would break the ABI's function calling sequence. Note that the compiler would be allowed to copy `x` to a different location (but why should reads/write to this location remain ordered/unaltered?) but it certainly cannot accept the `x` argument in memory without breaking the ABI. – avdgrinten Jul 07 '16 at 15:23
Thanks for the explanation. Does the same apply to 32-bit x86? I don't have a 32-bit-only machine at hand, but cross-compiling with MSYS2's MinGW32 produces identical ASM... though now I'm wondering whether my Debian `stable` box that produced the ASM above is 32-bit! I'll wait to see what the folks at GCC conclude, but I think you might well be right. If so, this probably just arose from a combination of ABI compliance and lack of special-case handling for this odd scenario - like copying to memory & using that as you said. If so, we'll see whether they add an (academic) workaround for this. – underscore_d Jul 07 '16 at 15:38
1

@avdgrinten: Why can't a compiler allocate x in memory? The caller isn't going to put the value into memory, but all that means is that the function prologue code will have to do so. If `x` is volatile and the function contains a setjmp, I would think a compiler would likely have to keep it in memory and treat it as `volatile` whether or not its address is taken unless the compiler knows that no `setjmp` will occur unexpectedly. – supercat Jul 08 '16 at 01:02
1

I agree that the compiler should certainly copy the value into memory (and not elide accesses to this memory location) if the function contains a `setjmp` or lets a pointer to the local volatile variable escape the function. In other cases my reading of the standard is that it is okay to elide volatile read/writes, even if the variable was not stored in a register: The compiler behaves `as-if` the volatile read/write actually took place, because it can prove that no one (and not even memory mapped hardware, signal handlers or other async events) can actually observe the access. – avdgrinten Jul 08 '16 at 10:37