139

Suppose I have a number of statements that I want to execute in a fixed order. I want to use g++ with optimization level 2, so some statements could be reordered. What tools does one have to enforce a certain ordering of statements?

Consider the following example.

using Clock = std::chrono::high_resolution_clock;

auto t1 = Clock::now(); // Statement 1
foo();                  // Statement 2
auto t2 = Clock::now(); // Statement 3

auto elapsedTime = t2 - t1;

In this example it is important that the statements 1-3 are executed in the given order. However, can't the compiler think statement 2 is independent of 1 and 3 and execute the code as follows?

using Clock=std::chrono::high_resolution_clock;

foo();                  // Statement 2
auto t1 = Clock::now(); // Statement 1
auto t2 = Clock::now(); // Statement 3

auto elapsedTime = t2 - t1;
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
S2108887
  • 1,263
  • 2
  • 9
  • 7
  • 36
    If the compiler thinks they're independent when they're not, the compiler is broken and you should use a better compiler. – David Schwartz Jun 13 '16 at 11:43
  • 19
    http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0342r0.html – Howard Hinnant Jun 13 '16 at 14:29
  • 1
    could `__sync_synchronize()` be of any help? – vsz Jun 13 '16 at 16:30
  • Try putting foo in a different compilation unit to the function above. This may stop the compiler from being able to analyse it and thus force it to keep the same ordering. – Martin York Jun 13 '16 at 18:35
  • 3
    @HowardHinnant: The semantic power of standard C would be improved tremendously if such a directive were defined, and if the aliasing rules were adjusted to exempt reads performed after a barrier of data which was written before it. – supercat Jun 13 '16 at 18:40
  • 1
    @LokiAstari Jeremy's answer already showed that effect, but also showed that (predictably) LTO defeats it too. – underscore_d Jun 13 '16 at 18:40
  • 5
    @DavidSchwartz In this case it's about measuring the time `foo` takes to run, which the compiler is allowed to ignore when reordering, just like it's allowed to ignore observation from a different thread. – CodesInChaos Jun 14 '16 at 07:17
  • If I understand your answers correctly, reordering of statements can be avoided by putting the statements in functions in different compilation units. However, if LTO is on, this does not work either. Also, the existence of the proposal pointed at by Howard Hinnant – S2108887 Jun 14 '16 at 07:41
  • indicates there is no perfect way to avoid this issue. I am going to accept the answer of Jeremy. Thank you all for your help. :) – S2108887 Jun 14 '16 at 07:49
  • @CodesInChaos There is not required to be any such thing as "the time foo takes to run". The compiler is free to rearrange the code such that foo's work is done wherever and whenever it thinks best, interspersing other work that might need to done however it pleases. You will need compiler-specific knowledge to ensure that "the time foo takes to run" even is a thing that could be measured. – David Schwartz Jun 14 '16 at 16:40
  • @S2108887 read the question and accepted answer in [this thread](http://stackoverflow.com/questions/12631856/difference-between-rdtscp-rdtsc-memory-and-cpuid-rdtsc) for further insights – Steve Lorimer Jun 14 '16 at 18:46
  • 2
    Volatile function pointers are your friends. Also, if you're profiling, use a tool like Valgrind. – lorro Jun 28 '16 at 07:26
  • `foo()` could take `t1` as a parameter, then return `0.0` that's added to `t2`. – lorro Jun 30 '16 at 13:59
  • I guess memory barriers are what you are looking for ? http://en.cppreference.com/w/cpp/atomic/memory_order http://stackoverflow.com/questions/7346163/efficient-memory-barriers – user2346536 Jul 01 '16 at 07:00

7 Answers7

134

I'd like to try to provide a somewhat more comprehensive answer after this was discussed with the C++ standards committee. In addition to being a member of the C++ committee, I'm also a developer on the LLVM and Clang compilers.

Fundamentally, there is no way to use a barrier or some operation in the sequence to achieve these transformations. The fundamental problem is that the operational semantics of something like an integer addition are totally known to the implementation. It can simulate them, it knows they cannot be observed by correct programs, and is always free to move them around.

We could try to prevent this, but it would have extremely negative results and would ultimately fail.

First, the only way to prevent this in the compiler is to tell it that all of these basic operations are observable. The problem is that this then would preclude the overwhelming majority of compiler optimizations. Inside the compiler, we have essentially no good mechanisms to model that the timing is observable but nothing else. We don't even have a good model of what operations take time. As an example, does converting a 32-bit unsigned integer to a 64-bit unsigned integer take time? It takes zero time on x86-64, but on other architectures it takes non-zero time. There is no generically correct answer here.

But even if we succeed through some heroics at preventing the compiler from reordering these operations, there is no guarantee this will be enough. Consider a valid and conforming way to execute your C++ program on an x86 machine: DynamoRIO. This is a system that dynamically evaluates the machine code of the program. One thing it can do is online optimizations, and it is even capable of speculatively executing the entire range of basic arithmetic instructions outside of the timing. And this behavior isn't unique to dynamic evaluators, the actual x86 CPU will also speculate (a much smaller number of) instructions and reorder them dynamically.

The essential realization is that the fact that arithmetic isn't observable (even at the timing level) is something that permeates the layers of the computer. It is true for the compiler, the runtime, and often even the hardware. Forcing it to be observable would both dramatically constrain the compiler, but it would also dramatically constrain the hardware.

But all of this should not cause you to lose hope. When you want to time the execution of basic mathematical operations, we have well studied techniques that work reliably. Typically these are used when doing micro-benchmarking. I gave a talk about this at CppCon2015: https://youtu.be/nXaxk27zwlk

The techniques shown there are also provided by various micro-benchmark libraries such as Google's: https://github.com/google/benchmark#preventing-optimization

The key to these techniques is to focus on the data. You make the input to the computation opaque to the optimizer and the result of the computation opaque to the optimizer. Once you've done that, you can time it reliably. Let's look at a realistic version of the example in the original question, but with the definition of foo fully visible to the implementation. I've also extracted a (non-portable) version of DoNotOptimize from the Google Benchmark library which you can find here: https://github.com/google/benchmark/blob/v1.0.0/include/benchmark/benchmark_api.h#L208

#include <chrono>

template <class T>
__attribute__((always_inline)) inline void DoNotOptimize(const T &value) {
  asm volatile("" : "+m"(const_cast<T &>(value)));
}

// The compiler has full knowledge of the implementation.
static int foo(int x) { return x * 2; }

auto time_foo() {
  using Clock = std::chrono::high_resolution_clock;

  auto input = 42;

  auto t1 = Clock::now();         // Statement 1
  DoNotOptimize(input);
  auto output = foo(input);       // Statement 2
  DoNotOptimize(output);
  auto t2 = Clock::now();         // Statement 3

  return t2 - t1;
}

Here we ensure that the input data and the output data are marked as un-optimizable around the computation foo, and only around those markers are the timings computed. Because you are using data to pincer the computation, it is guaranteed to stay between the two timings and yet the computation itself is allowed to be optimized. The resulting x86-64 assembly generated by a recent build of Clang/LLVM is:

% ./bin/clang++ -std=c++14 -c -S -o - so.cpp -O3
        .text
        .file   "so.cpp"
        .globl  _Z8time_foov
        .p2align        4, 0x90
        .type   _Z8time_foov,@function
_Z8time_foov:                           # @_Z8time_foov
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rbx
.Ltmp0:
        .cfi_def_cfa_offset 16
        subq    $16, %rsp
.Ltmp1:
        .cfi_def_cfa_offset 32
.Ltmp2:
        .cfi_offset %rbx, -16
        movl    $42, 8(%rsp)
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, %rbx
        #APP
        #NO_APP
        movl    8(%rsp), %eax
        addl    %eax, %eax              # This is "foo"!
        movl    %eax, 12(%rsp)
        #APP
        #NO_APP
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        subq    %rbx, %rax
        addq    $16, %rsp
        popq    %rbx
        retq
.Lfunc_end0:
        .size   _Z8time_foov, .Lfunc_end0-_Z8time_foov
        .cfi_endproc


        .ident  "clang version 3.9.0 (trunk 273389) (llvm/trunk 273380)"
        .section        ".note.GNU-stack","",@progbits

Here you can see the compiler optimizing the call to foo(input) down to a single instruction, addl %eax, %eax, but without moving it outside of the timing or eliminating it entirely despite the constant input.

Hope this helps, and the C++ standards committee is looking at the possibility of standardizing APIs similar to DoNotOptimize here.

admo
  • 51
  • 1
  • 6
Chandler Carruth
  • 3,011
  • 1
  • 18
  • 26
  • 2
    Thank you for your answer. I have marked it as the new best answer. I could have done this earlier, but I have not read this stackoverflow page for many months. I am very interested in using the Clang compiler to make C++ programs. Among other things, I like that one can use Unicode characters in variable names in Clang. I think I will ask more questions about Clang on Stackoverflow. – S2108887 Dec 10 '16 at 09:38
  • 5
    While I understand how this prevents foo being optimized away completely, can you elaborate a bit why this prevents the calls to `Clock::now()` being reordered relative to foo()? Does the optimzer have to assume that `DoNotOptimize` and `Clock::now()` have access to and might modify some common global state which in turn would tie them to the in- and output? Or are you relying on some current limitations of the optimizer's implementation? – MikeMB Feb 22 '17 at 11:43
  • 3
    `DoNotOptimize` in this example is a synthetically "observable" event. It is as if it notionally printed visible output to some terminal with the input's representation. Since reading the clock is also observable (you're observing time passing) they cannot be re-ordered without changing the observable behavior of the program. – Chandler Carruth Mar 26 '17 at 02:38
  • 1
    I'm still not quite clear with the concept "observable", if the `foo` function is doing some operations like reading from a socket which may be blocked for a while, does this count a observable operation? And since the `read` is not a "totally known" operation (right?), will the code keep in order? –  Sep 18 '18 at 08:39
  • "The fundamental problem is that the operational semantics of something like an integer addition are totally known to the implementation." But it seems to me that the issue isn't the semantics of integer addition, it's the semantics of calling function foo(). Unless foo() is in the same compilation unit, how does it know that foo() and clock() don't interact? – Dave Feb 20 '19 at 15:10
  • @MikeMB until someone comes with a precise quote from the C++ standard, I would assert that the first reordering cannot happen by passing t1 as a new input parameter to DoNotOptimize. But I'm not sure how to prevent the reordering of the second one since Clock::now() takes no inputs! I think I would move it into a `__attribute__ ((noinline))` function that takes an input. Or maybe call `::now` from assembly, but that is hard: https://stackoverflow.com/questions/37502841/calling-printf-in-extended-inline-asm/37503773#37503773 – Ciro Santilli OurBigBook.com Jun 26 '19 at 15:59
  • OK, I went ahead and created a paranoid example with full data dependencies which I'm pretty sure cannot be reordered: https://stackoverflow.com/questions/37786547/enforcing-statement-order-in-c/56865717#56865717 – Ciro Santilli OurBigBook.com Jul 03 '19 at 08:54
  • The only problem that's left with `DoNotOptimize()` is, that the input needs to be reloaded from memory and the output needs to be written to memory within the critical region. For a simple one-cycle-latency instruction like `addl`, that's significant overhead. Just like the calling of the timer function is significant overhead. I think, the only way to eliminate these overheads from your measurement is to time a loop containing the operation of interest (with the input and output properly `DoNotOptimize()`ed). – cmaster - reinstate monica Nov 08 '19 at 10:00
  • `DoNotOptimize` works cause volatile access enforces strict sequentiality. – ABaumstumpf Mar 05 '20 at 07:40
  • @cmaster-reinstatemonica: Even a call via volatile-qualified function pointer need not always enforce ordering. A compiler could eagerly perform some computations and store results in a private place before reading the `volatile`-qualified pointer, check whether the pointer identifies a particular function the compiler has optimized for this particular call site, and either call the ordinary version of function identified by the pointer or invoke a version which is optimized for the particular call site, and which exploits the eagerly-performed computations. – supercat Mar 23 '21 at 22:07
  • @supercat The point about `DoNotOptimize()` is that it declares the memory behind its argument as being changed by the assembler code. As such, the compiler is forced to a) reload the input value from memory after the call, b) to not let any computation within the critical section rely on a stale version of the input, and c) to write the result to memory before the second `DoNotOptimize()` call. All computations that rely on input and are relevant to output must be sequenced between these two calls. If you fail to include I/O variables in the `DoNotOptimize()` calls, that's your mistake. – cmaster - reinstate monica Mar 23 '21 at 23:51
  • @cmaster-reinstatemonica: Would anything forbid a "clever" compiler from transforming `int x = operationThatProduces5(); DoNotOptimize(x); doSomething(x);` into `int x = operationThatProduces5(); DoNotOptimize(x); if (x == 5) doSomething(5); else doSomething(x);`? – supercat Aug 04 '21 at 21:21
  • @supercat As long as the `x` in `if(x == 5)` is reloaded from memory after the `DoNotOptimize(x)` call, the compiler is free to do any kind of insane stuff to perform the `doSomething(x)` call. So, yes, it would be possible for the CPU to speculatively execute `doSomething(5)` before `x` has been reloaded from memory. However, the compiler has no reason to believe that `x == 5` would remain true after the `DoNotOptimize(x)` call, and will not introduce the danger of a pipeline flush due to misprediction of `if(x == 5)` if it is any good. – cmaster - reinstate monica Aug 05 '21 at 05:06
  • @cmaster-reinstatemonica: In many use cases, after a sequence like the above, the value of `x` would be 5 more than 25% of the time, if not more than 50%. Thus, having a compiler evaluate the cost of a hard-coded x==5 variant and perform the comparison/substitution if it would be 80% or more cheaper than the general-case version could be a very sensible optimization if the compiler didn't use the text of the Standard as an process code in a way contrary to the intention of the Standard's authors as documented in the Rationale. Fundamentally, the Standard was never designed to draw... – supercat Aug 05 '21 at 14:52
  • ...a clear line between constructs that all implementations must support and those that will never be used by non-broken programs, because no single partitioning could be suitable for more than a small fraction of the purposes to which C is put. While some compiler maintainers seem to need bright-line rules, such rules could only be written sensibly if the Standard weakened a couple of long-standing principles, most notably: 1. Replace the idea that all meaningfully-conforming programs should be expected to run meaningfully on all implementations with a requirement that... – supercat Aug 05 '21 at 14:58
  • ...conforming implementations *must reject* any meaningfully-conforming programs that they cannot otherwise process meaningfully, so constructs that are needed for some tasks but not others could be used in meaningfully-conforming programs to accomplish those tasks, but implementations not intended for those tasks need not support them. 2. Use an abstraction model in which certain optimizations may have observable effects; a program whose actions make such effects observable would be a correct program if all allowable combinations of effects would yield behavior meeting requirements. – supercat Aug 05 '21 at 15:03
  • @supercat What does the standard have to do with that? Afaik, the contents of an `asm` statement is not standardized in any way. However, the compilers do describe its syntax and semantics, especially of the `"+m"` directive: It means that the compiler will consider the memory behind the pointer to be both input and output of the (empty) assembler code, and as such, the compiler must assume that the value of `x` does change. To the compiler, the chance of `x` remaining the same is only `2^-64`, assuming a 64 bit value. It should definitely no try to optimize that. – cmaster - reinstate monica Aug 05 '21 at 15:16
  • If e.g. code uses a `float*` to store a value, and then uses a compiler-defined barrier, and then uses a `uint32_t*` to read the bits of the storage, the Standard would not recognize any effect the barrier might have in preventing the read from invoking Undefined Behavior. If, however, the Standard were to replace N1570 6.5p7 with a rule that instead said that a read of a `uint32_t*` may *generally* be regarded as unsequenced relative to a preceding or following write of type `float*`, then the effects of the barrier would be clear even if the Standard knew nothing about it. – supercat Aug 05 '21 at 15:25
  • @cmaster-reinstatemonica: Further, although a compiler wouldn't be entitled to ignore the possibility of `x` changing in your barrier scenario, a compiler that processes `x = 5; barrier(); doSomething(x);` as `x = 5; barrier(); if (x==5) doSomething(5); else doSomething(x);` would not be making a correctness-threatening assumption that `x` won't change, but rather a performance-affecting assumption that `x` won't always change, possibly generating code that will perform worse than straightforward code if `x` changes more than 75% of the time, but better if it doesn't. – supercat Aug 05 '21 at 15:48
  • @supercat Ah, now I see why you see the standard relevant in this. As to the "optimization" that the compiler is allowed to do: Yes, the compiler has every right to emit the `if(x == 5)` code you gave. Nevertheless, it would be a pretty daft thing to do: The whole point point of the `asm` statement part after the `:` is to allow the savy programmer to specify what the compiler can assume, and what it cannot assume. This syntax is not used all over the place (= low optimization necessity), and is used by pretty hard-core low-level programmers who can read and use specs (= best treated as gods). – cmaster - reinstate monica Aug 05 '21 at 19:58
  • @cmaster-reinstatemonica: Many of the "optimizations" that programmers complain about today would have been universally recognized as downright obtuse when the Standard is written, and thus *there was no need to have the Standard forbid them*. Compiler writers, however, often seem to care much more about optimizations than their users do, and regard the fact that a construct would block an optimization the they want to perform as a being problem with the construct, rather than being *the reason it was used in the first place*. – supercat Aug 05 '21 at 20:07
62

Summary:

There seems to be no guaranteed way to prevent reordering, but as long as link-time/full-program optimisation is not enabled, locating the called function in a separate compilation unit seems a fairly good bet. (At least with GCC, although logic would suggest that this is likely with other compilers too.) This comes at the cost of the function call - inlined code is by definition in the same compilation unit and open to reordering.

Original answer:

GCC reorders the calls under -O2 optimisation:

#include <chrono>
static int foo(int x)    // 'static' or not here doesn't affect ordering.
{
    return x*2;
}
int fred(int x)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    int y = foo(x);
    auto t2 = std::chrono::high_resolution_clock::now();
    return y;
}

GCC 5.3.0:

g++ -S --std=c++11 -O0 fred.cpp :

_ZL3fooi:
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %ecx, 16(%rbp)
        movl    16(%rbp), %eax
        addl    %eax, %eax
        popq    %rbp
        ret
_Z4fredi:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $64, %rsp
        movl    %ecx, 16(%rbp)
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, -16(%rbp)
        movl    16(%rbp), %ecx
        call    _ZL3fooi
        movl    %eax, -4(%rbp)
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, -32(%rbp)
        movl    -4(%rbp), %eax
        addq    $64, %rsp
        popq    %rbp
        ret

But:

g++ -S --std=c++11 -O2 fred.cpp :

_Z4fredi:
        pushq   %rbx
        subq    $32, %rsp
        movl    %ecx, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        call    _ZNSt6chrono3_V212system_clock3nowEv
        leal    (%rbx,%rbx), %eax
        addq    $32, %rsp
        popq    %rbx
        ret

Now, with foo() as an extern function:

#include <chrono>
int foo(int x);
int fred(int x)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    int y = foo(x);
    auto t2 = std::chrono::high_resolution_clock::now();
    return y;
}

g++ -S --std=c++11 -O2 fred.cpp :

_Z4fredi:
        pushq   %rbx
        subq    $32, %rsp
        movl    %ecx, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movl    %ebx, %ecx
        call    _Z3fooi
        movl    %eax, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movl    %ebx, %eax
        addq    $32, %rsp
        popq    %rbx
        ret

BUT, if this is linked with -flto (link-time optimisation):

0000000100401710 <main>:
   100401710:   53                      push   %rbx
   100401711:   48 83 ec 20             sub    $0x20,%rsp
   100401715:   89 cb                   mov    %ecx,%ebx
   100401717:   e8 e4 ff ff ff          callq  100401700 <__main>
   10040171c:   e8 bf f9 ff ff          callq  1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
   100401721:   e8 ba f9 ff ff          callq  1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
   100401726:   8d 04 1b                lea    (%rbx,%rbx,1),%eax
   100401729:   48 83 c4 20             add    $0x20,%rsp
   10040172d:   5b                      pop    %rbx
   10040172e:   c3                      retq
Jeremy
  • 5,055
  • 1
  • 28
  • 44
  • 3
    So does MSVC and ICC. Clang is the only one that seems to preserve the original sequence. – Cody Gray - on strike Jun 13 '16 at 12:37
  • With asm volatile("" ::: "memory"); does it prevent the ordering? – paulm Jun 13 '16 at 13:13
  • +1 for a good example of reordering. But I wish you added another sentence about how to prevent that from happening. Can we chop this up into several small functions, and can we assume one function finishes before the next one starts even if it's an inline function, or do we need ["volatile"](https://en.wikipedia.org/wiki/volatile_variable) or ["std::atomic_thread_fence"](http://en.cppreference.com/w/cpp/atomic/atomic_thread_fence) to really be sure statements occur in a particular order? – David Cary Jun 13 '16 at 13:24
  • With the code more as presented in the OP, gcc seems to do the right thing; https://godbolt.org/g/C93rRK – Niall Jun 13 '16 at 13:24
  • @Niall - As long as foo() is in a separate compilation unit this seems to be the case. If foo() is defined in the same compilation unit the reordering still happens - presumably because in this latter case the compiler can determine that foo() is not affected by any potential side effects of now(), whereas in the former case it can't. – Jeremy Jun 13 '16 at 13:38
  • I agree with David Cary that this is a good example and I would also like to see information about ways to avoid reordering. – S2108887 Jun 13 '16 at 13:46
  • 3
    you don't use t1 and t2 anywhere so it may think the result can be discarded and reorder the code – phuclv Jun 13 '16 at 13:49
  • Even if the time results are used, the reordering happens in the presence of a definition of the function between the two time calls... https://godbolt.org/g/EmvZMy – Niall Jun 13 '16 at 13:51
  • 3
    @Niall - I can't offer anything more concrete, but I think my comment alludes to the underlying reason: The compiler knows that foo() cannot affect now(), nor vice versa, and so does the reordering. Various experiments involving extern scope functions and data seem to confirm this. This includes having static foo() depend on a file-scope variable N - if N is declared as static, reordering occurs, whereas if it's declared non-static (i.e. it's visible to other compilation units, and hence potentially subject to side effects of extern functions such as now()) reordering does not occur. – Jeremy Jun 13 '16 at 13:54
  • 3
    @ Lưu Vĩnh Phúc: Except that the calls themselves are not elided. Once again, I suspect this is because the compiler doesn't know what their side effects might be - but it _does_ know that those side effects cannot influence the behaviour of foo(). – Jeremy Jun 13 '16 at 13:57
  • 3
    And a final note: specifying -flto (link-time optimisation) causes reordering even in otherwise non-reordered cases. – Jeremy Jun 13 '16 at 14:01
  • A declaration of an integer, and a call to some `foo()`, are not in any way the same thing. – Lightness Races in Orbit Jun 13 '16 at 14:35
  • @Lightness - Agreed, hence my subsequent comments. – Jeremy Jun 13 '16 at 14:45
  • 1
    Modified my example to match the question. Results and conclusions (most of which were based on this modified code anyway) remain valid. – Jeremy Jun 13 '16 at 14:58
  • `auto fred(volatile int x) { auto t1 = std::chrono::high_resolution_clock::now(); volatile int y = foo(x); auto t2 = std::chrono::high_resolution_clock::now(); return t2-t1; }` seems not to become reordered. – BartekChom Jun 13 '16 at 20:35
  • Sorry, in C++11 we need `decltype(std::chrono::high_resolution_clock::now()-std::chrono::high_resolution_clock::now()) fred(volatile int x)` instead of `auto fred(volatile int x)`. – BartekChom Jun 13 '16 at 21:07
  • Reordering of assembly often occurs with the higher optimization settings, and is often done by analyzing memory/register access independant of the original source code. As well, some sets of less-optimal instructions can be replaced by a more optimal set of instructions. These are sometimes called 'peep-hole optimizations', because the optimizer is only looking at that small section of instructions in an effort to find patterns that can be optimized. – Chris Steele Jun 13 '16 at 23:32
  • 2
    How does this answer the question? – Kieren Johnstone Jun 14 '16 at 08:55
  • Should be noted though, that no matter what the compiler does or does not, there is also a processor reordering which can also have surprising effects. That is less visible with Intel processors than with ARM processors. – SomeWittyUsername Jun 25 '16 at 06:57
20

Reordering may be done by the compiler, or by the processor.

Most compilers offer a platform-specific method to prevent reordering of read-write instructions. On gcc, this is

asm volatile("" ::: "memory");

(More information here)

Note that this only indirectly prevents reordering operations, as long as they depend on the reads / writes.

In practice I haven't yet seen a system where the system call in Clock::now() does have the same effect as such a barrier. You could inspect the resulting assembly to be sure.

It is not uncommon, however, that the function under test gets evaluated during compile time. To enforce "realistic" execution, you may need to derive input for foo() from I/O or a volatile read.


Another option would be to disable inlining for foo() - again, this is compiler specific and usually not portable, but would have the same effect.

On gcc, this would be __attribute__ ((noinline))


@Ruslan brings up a fundamental issue: How realistic is this measurement?

Execution time is affected by many factors: one is the actual hardware we are running on, the other is concurrent access to shared resources like cache, memory, disk and CPU cores.

So what we usually do to get comparable timings: make sure they are reproducible with a low error margin. This makes them somewhat artificial.

"hot cache" vs. "cold cache" execution performance can easily differ by an order of magnitude - but in reality, it will be something inbetween ("lukewarm"?)

peterchen
  • 40,917
  • 20
  • 104
  • 186
  • 2
    Your hack with `asm` affects execution time of the statements between timer calls: the code after memory clobber has to reload all variables from memory. – Ruslan Jun 13 '16 at 11:07
  • @Ruslan: Their hack, not mine. There are different levels of purging, and doing something like that is unavoidable for reproducible results. – peterchen Jun 13 '16 at 13:03
  • 2
    Note that the hack with 'asm' only helps as a barrier for operations which touch memory, and the OP is interested in more than that. See my answer for more details. – Chandler Carruth Jun 27 '16 at 20:01
11

The C++ language defines what is observable in a number of ways.

If foo() does nothing observable, then it can be eliminated completely. If foo() only does a computation that stores values in "local" state (be it on the stack or in an object somewhere), and the compiler can prove that no safely-derived pointer can get into the Clock::now() code, then there are no observable consequences to moving the Clock::now() calls.

If foo() interacted with a file or the display, and the compiler cannot prove that Clock::now() does not interact with the file or the display, then reordering cannot be done, because interaction with a file or display is observable behavior.

While you can use compiler-specific hacks to force code not to move around (like inline assembly), another approach is to attempt to outsmart your compiler.

Create a dynamically loaded library. Load it prior to the code in question.

That library exposes one thing:

namespace details {
  void execute( void(*)(void*), void *);
}

and wraps it like this:

template<class F>
void execute( F f ) {
  struct bundle_t {
    F f;
  } bundle = {std::forward<F>(f)};

  auto tmp_f = [](void* ptr)->void {
    auto* pb = static_cast<bundle_t*>(ptr);
    (pb->f)();
  };
  details::execute( tmp_f, &bundle );
}

which packs up a nullary lambda and uses the dynamic library to run it in a context that the compiler cannot understand.

Inside the dynamic library, we do:

void details::execute( void(*f)(void*), void *p) {
  f(p);
}

which is pretty simple.

Now to reorder the calls to execute, it must understand the dynamic library, which it cannot while compiling your test code.

It can still eliminate foo()s with zero side effects, but you win some, you lose some.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • 20
    *"another approach is to attempt to outsmart your compiler"* If that phrase isn't a sign of having gone down the rabbit hole, I don't know what is. :-) – Cody Gray - on strike Jun 13 '16 at 14:54
  • 1
    I think it might be helpful to note that *the time required for a block of code to execute is not considered an "observable" behavior which compilers are required to maintain*. If the time to execute a block of code were "observable", then no forms of performance optimization would be permissible. While it would be helpful for C and C++ to define a "causality barrier" which would require a compiler to hold off on executing any code after the barrier until all side-effects from before the barrier had been handled by the generated code [code which wants to ensure that data has fully... – supercat Jun 13 '16 at 18:12
  • 1
    ...propagated through hardware caches would need to use hardware-specific means to do that, but a hardware-specific means of waiting until all posted writes were complete would be useless without a barrier directive to ensure that all pending writes tracked by the compiler must get posted to the hardware before the hardware is asked to ensure that all posted writes are complete.] I know of no way of doing that in either language without using a dummy `volatile` access or call to outside code. – supercat Jun 13 '16 at 18:15
3

No it can't. According to the C++ standard [intro.execution]:

14 Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated.

A full-expression is basically a statement terminated by a semicolon. As you can see the above rule stipulates statements must be executed in order. It is within statements that the compiler is allowed more free rein (i.e. it is under some circumstance allowed to evaluate expressions that make up a statement in orders other than left-to-right or anything else specific).

Note the conditions for the as-if rule to apply are not met here. It is unreasonable to think that any compiler would be able to prove that reordering calls to get the system time would not affect observable program behaviour. If there was a circumstance in which two calls to get the time could be reordered without changing observed behaviour, it would be extremely inefficient to actually produce a compiler that analyses a program with enough understanding to be able to infer this with certainty.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
Smeeheey
  • 9,906
  • 23
  • 39
  • 12
    There's still the as-if rule though – M.M Jun 13 '16 at 09:58
  • 18
    By _as-if rule_ compiler can do anything to code as long as it does not change observable behavior. Time of execution is not observable. So it can reorder arbutrary lines of code as long as result would be same (most compiler do sensible thing and not reorder time calls, but it is not required) – Revolver_Ocelot Jun 13 '16 at 10:00
  • 6
    *Time of execution is not observable.* This is quite strange. From a practical, non-technical point of view, time of execution (a.k.a. "performance") is *very* observable. – Frédéric Hamidi Jun 13 '16 at 10:06
  • 2
    @FrédéricHamidi: But it's not a metric by which C++ defines program behaviour. – Lightness Races in Orbit Jun 13 '16 at 10:06
  • 3
    Depends on how you measure time. It is not possible to measure the number of clock cycles taken to execute some body of code in standard C++. – Peter Jun 13 '16 at 10:07
  • 2
    Problem is that the "as if" rule means that this answer doesn't always address the question. The compiler is allowed to reorder statements in order to optimise execution time, as long as the observable output of the program (e.g. to files) doesn't change. – Peter Jun 13 '16 at 10:08
  • I don't buy the reasoning for the "as if' rule -- the same logic would imply that the compiler can't invoke it to do any optimization *between* two calls to the clock either, or really, to do any optimizations at all. –  Jun 13 '16 at 10:24
  • @Hurkyl No, I would put it differently - any optimisation between clock calls stays between clock calls. – Smeeheey Jun 13 '16 at 10:29
  • 2
    [Execution time is not observable](http://stackoverflow.com/a/36814227/3410396) so _"observable behaviour clearly would be affected"_ is wrong. In addition I would expect, say Microsoft compiler compiling for Win system to have an extensive knowledge about system calls and be able to optimise them, so previous paragraph is questionable too. – Revolver_Ocelot Jun 13 '16 at 10:37
  • @Revolver_Ocelot - fine, I have to agree that I haven't justified the statement "observable behaviour would be affected" enough, so I removed it. However, the rest of the point still holds. I would be very surprised if any compiler allowed itself this optimisation. – Smeeheey Jun 13 '16 at 10:46
  • Note I'm not claiming it is theoretically impossible, I'm talking about a practical point here. – Smeeheey Jun 13 '16 at 10:46
  • 1
    I agree about impracticality of such behavior. I was talking from language lawyer position. – Revolver_Ocelot Jun 13 '16 at 10:49
  • @revolver It does not. Microsoft's compiler assumes nothing about calls to Win32 API functions. It is a bit smarter about *library* functions, providing intrinsics and inlining code, etc., but GCC and Clang both do exactly the same thing there. It could not possibly do so, otherwise a compiler would be tied to a particular version of the operating system. – Cody Gray - on strike Jun 13 '16 at 10:50
  • @Cody Microsoft compilers *are* tied to a particular version of operating system. For example, you cannot generate a Win16 application with the latest VS. Worse yet, you probably will have a hard time making the latest VS generate something that would run on NT 4. Hell, I think they already dropped XP, and that is NT 5. – dbanet Jun 13 '16 at 12:58
  • 3
    @dba You're mixing a few things together. The linker can no longer generate Win16 applications, that's true enough, but that's because they have removed support for generating that type of binary. WIn16 apps don't use the PE format. That doesn't imply that either the compiler or linker has special knowledge about the API functions. The other issue is related to the runtime library. There is absolutely no problem getting the latest version of MSVC to generate a binary that runs on NT 4. I have done it. The problem comes as soon as you try to link in the CRT, which calls functions not available. – Cody Gray - on strike Jun 13 '16 at 13:01
  • @Cody doesn't the CRT get linked dynamically before execution on the target platform? Anyway, I was not claiming what I just said necessarily implies MSVC assumes something about the calls to Win32, I am just showing the contrary to the consequent is not implied by what you said, by counterexample. – dbanet Jun 13 '16 at 13:12
  • @dba The CRT can be linked either statically or dynamically, but that is not relevant. The problem is that there are different versions of the CRT, and newer versions of the CRT call Windows API functions that are not available on older versions of the OS. Like the version of the CRT bundled with VS 2010 calls `EncodePointer`, which doesn't exist on Win 2K. If you linked the CRT statically, then it would fail because `EncodePointer` doesn't exist. If you linked it dynamically, then it would fail because the CRT DLL is missing. What I'm saying is that you have not provided a counterexample. – Cody Gray - on strike Jun 13 '16 at 13:14
  • If you use a different version of the CRT, one that supports Win NT 4, then it will work fine, no matter which version of the compiler that you compile it with. Or if you stub out `EncodePointer`, providing your own implementation of it. From the compiler's perspective, the CRT, the Windows API, and your own functions are all black boxes. It has no special knowledge about what they do or what they mean. They're just functions that conform to a particular ABI. – Cody Gray - on strike Jun 13 '16 at 13:16
  • @Cody you have now just produced an elaboration on my counterexample to the belief that (especially Microsoft's) compilers are not tied to operating system versions (like, come on, MSVC is limited to *one* flavour of NT (well, not true anymore since they have Windows Phone and Surface and so on, which are ARM)). You cannot prove ``MSVC assumes nothing about calls to Win32 API function`` from a belief for which a valid counterexample exists. – dbanet Jun 13 '16 at 13:21
  • 2
    @dba I guess I don't understand what you are saying. I still feel like you are mixing together multiple concepts. The linker is designed to generate a single type of binary, a binary format supported only by Windows NT. Same for the x86 instruction set. But that has nothing to do with having special knowledge about API functions and optimizing accordingly. It does not do that. The runtime library uses API functions, but so can your code. The compiler doesn't care; it makes no distinction between your code and the CRT's code. The statement of mine that you quoted is still absolutely valid. – Cody Gray - on strike Jun 13 '16 at 13:26
  • @Cody it's sad you don't understand what am I saying and keep digging into the semantics. I'm not saying what I quoted is wrong, I'm saying its premise is wrong. *Additionally* I think @​revolver's original expectation of compilers made exclusively for a platform being able to figure stuff about that platform out is reasonable and sound, at least theoretically, if not even practically, as previously shown. – dbanet Jun 13 '16 at 13:47
  • 1
    @dbanet I'm not sure either what you're trying to argue. You clearly stated "Microsoft compilers are tied to a particular version of operating system". This is provably wrong, since you can easily generate binaries with VS2015 that will run on five different operating systems (and VS2008 can generate perfectly fine binaries for Win10 - clearly without a time machine that'd be rather complicated to pull off). – Voo Jun 13 '16 at 17:29
  • 1
    If your claim is that MSVC has hardcoded information about API calls in it.. extraordinary claims require extraordinary proof. I mean think about the maintenance hassle that would imply and the little advantage it would bring (the kernel and the compiler are written by two completely different teams at MS, they don't even know each other for the most part). Different versions of MSVC have different intrinsics and behavior and the compiler is generally intimately linked to the runtime (you can get around that though) but that's about it. – Voo Jun 13 '16 at 17:32
  • 1
    +1 for being the only answer so far to refer to the standard (whether or not the interpretation is correct!). – PJTraill Jun 16 '16 at 08:22
  • @Voo please re-read my comments starting with [the first one](http://stackoverflow.com/questions/37786547/enforcing-statement-order-in-c/37786733?noredirect=1#comment63046888_37786733) which precisely addresses your issue. – dbanet Jun 17 '16 at 23:50
  • On your second comment: no, that is not my claim, please, again, [read carefully](http://stackoverflow.com/questions/37786547/enforcing-statement-order-in-c/37786733?noredirect=1#comment63049051_37786733). I am not stating anything I have not stated before, but here it still goes: @Cody's claim was X. He tried to show X by showing it follows from Y. But that does not work, because Y is not true. I'm not claiming anything about X, I'm only claiming you cannot show X using Y, because Y itself doesn't hold. – dbanet Jun 17 '16 at 23:56
  • 1
    Not sure why I needed to get pinged back into this discussion, but if *Y* is the initial claim that you linked to, I'm pretty sure I have already conclusively disproven that. None of Microsoft's compilers are tied to a specific OS version. The *linker* is tied to a specific executable format/specification, which happens to only be supported by (32-bit and later) Windows NT, but that has nothing to do with the compiler and requires absolutely no internal knowledge of the OS's workings, since this is an open standard. And the *runtime library* makes OS-specific calls, but so too does user code. – Cody Gray - on strike Jun 19 '16 at 08:33
1

No.

Sometimes, by the "as-if" rule, statements may be re-ordered. This is not because they are logically independent of each other, but because that independence allows such a re-ordering to occur without changing the semantics of the program.

Moving a system call that obtains the current time obviously does not satisfy that condition. A compiler that knowingly or unknowingly does so is non-compliant and really silly.

In general, I wouldn't expect any expression that results in a system call to be "second-guessed" by even an aggressively optimizing compiler. It just doesn't know enough about what that system call does.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 5
    I agree that it would be silly, but I wold not call it _non-conformant_. Compiler can have knowledge what system call on concrete system exactly does and if it has side effects. I would expect compilers to not reorder such call just to cover common use case, allowing for better user experience, not because standard prohibits it. – Revolver_Ocelot Jun 13 '16 at 10:12
  • Might want to note that it is possible to reduce the odds of the compiler reordering statements. For example, calling a function that is defined in another compilation reduces the odds of statements before the function call being shifted so they are executed after - or vice versa. Similarly with inline assembler (although I've seen mention that gcc is smart enough to understand inline assembler in some cases). The only way to be sure, however, is to examine code emitted by the compiler. – Peter Jun 13 '16 at 10:15
  • @Peter gcc never analyzes inline assembly code. It can only do some optimizations based on data in extended asm syntax — clobber list, inputs, outputs etc. – Ruslan Jun 13 '16 at 11:05
  • 4
    @Revolver_Ocelot: Optimisations that change the semantics of the program (okay, save for copy elision) are non-compliant to the standard, whether you agree or not. – Lightness Races in Orbit Jun 13 '16 at 11:31
  • 6
    In the trivial case of `int x = 0; clock(); x = y*2; clock();` there are *no* defined ways for the `clock()` code to interact with the state of `x`. Under the C++ standard, it doesn't have to know what `clock()` does -- it could examine the stack (and notice when the computation occurs), but *that isn't C++'s problem*. – Yakk - Adam Nevraumont Jun 13 '16 at 14:33
  • 5
    To take Yakk's point further: it's true that re-ordering the system calls, so that the result of the first is assigned to `t2` and the second to `t1`, would be non-conforming and silly if those values are used, what this answer misses is that a conforming compiler can sometimes re-order other code across a system call. In this case, provided it knows what `foo()` does (for example because it has inlined it) and hence that (loosely speaking) it's a pure function then it can move it around. – Steve Jessop Jun 13 '16 at 22:38
  • 1
    .. again loosely speaking, this is because there's no guarantee that the actual implementation (albeit not the abstract machine) won't speculatively calculate `y*y` before the system call, just for fun. There is also no guarantee that the actual implementation won't use the result of this speculative calculation later at whatever point `x` is used, therefore doing nothing between the calls to `clock()`. The same goes for whatever an inlined function `foo` does, provided it has no side-effects and cannot depend on state that might be altered by `clock()`. – Steve Jessop Jun 13 '16 at 22:45
  • Can you back up _Moving a system call that obtains the current time obviously does not satisfy that condition_ (i.e. without changing semantics) _. A compiler that … does so is non-compliant … ._ with a quote? It all depends what the standard means by “semantics” (!) (if it uses that term); I agree that time should be considered observable but the problem is, I think, that `foo()` means “make the results defined according to the standard by the implementation of `foo` available to ‘subsequent’ code” — and results defined according to the standard tend not to include elapsing of time. – PJTraill Jun 16 '16 at 08:31
0

noinline function + inline assembly black box + full data dependencies

This is based on https://stackoverflow.com/a/38025837/895245 but because I didn't see any clear justification of why the ::now() cannot be reordered there, I would rather be paranoid and put it inside a noinline function together with the asm.

This way I'm pretty sure the reordering cannot happen, since the noinline "ties" the the ::now and the data dependency.

main.cpp

#include <chrono>
#include <iostream>
#include <string>

// noinline ensures that the ::now() cannot be split from the __asm__
template <class T>
__attribute__((noinline)) auto get_clock(T& value) {
    // Make the compiler think we actually use / modify the value.
    // It can't "see" what is going on inside the assembly string.
    __asm__ __volatile__ ("" : "+g" (value));
    return std::chrono::high_resolution_clock::now();
}

template <class T>
static T foo(T niters) {
    T result = 42;
    for (T i = 0; i < niters; ++i) {
        result = (result * result) - (3 * result) + 1;
    }
    return result;
}

int main(int argc, char **argv) {
    unsigned long long input;
    if (argc > 1) {
        input = std::stoull(argv[1], NULL, 0);
    } else {
        input = 1;
    }

    // Must come before because it could modify input
    // which is passed as a reference.
    auto t1 = get_clock(input);
    auto output = foo(input);
    // Must come after as it could use the output.
    auto t2 = get_clock(output);
    std::cout << "output " << output << std::endl;
    std::cout << "time (ns) "
              << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
              << std::endl;
}

GitHub upstream.

Compile and run:

g++ -ggdb3 -O3 -std=c++14 -Wall -Wextra -pedantic -o main.out main.cpp
./main.out 1000
./main.out 10000
./main.out 100000

The only minor downside of this method is that we add one extra callq instruction over an inline method. objdump -CD shows that main contains:

    11b5:       e8 26 03 00 00          callq  14e0 <auto get_clock<unsigned long long>(unsigned long long&)>
    11ba:       48 8b 34 24             mov    (%rsp),%rsi
    11be:       48 89 c5                mov    %rax,%rbp
    11c1:       b8 2a 00 00 00          mov    $0x2a,%eax
    11c6:       48 85 f6                test   %rsi,%rsi
    11c9:       74 1a                   je     11e5 <main+0x65>
    11cb:       31 d2                   xor    %edx,%edx
    11cd:       0f 1f 00                nopl   (%rax)
    11d0:       48 8d 48 fd             lea    -0x3(%rax),%rcx
    11d4:       48 83 c2 01             add    $0x1,%rdx
    11d8:       48 0f af c1             imul   %rcx,%rax
    11dc:       48 83 c0 01             add    $0x1,%rax
    11e0:       48 39 d6                cmp    %rdx,%rsi
    11e3:       75 eb                   jne    11d0 <main+0x50>
    11e5:       48 89 df                mov    %rbx,%rdi
    11e8:       48 89 44 24 08          mov    %rax,0x8(%rsp)
    11ed:       e8 ee 02 00 00          callq  14e0 <auto get_clock<unsigned long long>(unsigned long long&)>

so we see that foo was inlined, but get_clock were not and surround it.

get_clock itself however is extremely efficient, consisting of a single leaf call optimized instruction that doesn't even touch the stack:

00000000000014e0 <auto get_clock<unsigned long long>(unsigned long long&)>:
    14e0:       e9 5b fb ff ff          jmpq   1040 <std::chrono::_V2::system_clock::now()@plt>

Since the clock precision is itself limited, I think that is unlikely that you will be able to notice the timing effects of one extra jmpq. Note that one call is required regardless since ::now() is in a shared library.

Call ::now() from inline assembly with a data dependency

This would be the most efficient solution possible, overcoming even the extra jmpq mentioned above.

This is unfortunately extremely hard to do correctly as shown at: Calling printf in extended inline ASM

If your time measurement can be done directly in inline assembly without a call however, then this technique can be used. This is the case for example for gem5 magic instrumentation instructions, x86 RDTSC (not sure if this is representative anymore) and possibly other performance counters.

Related threads:

Tested with GCC 8.3.0, Ubuntu 19.04.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
  • 1
    You normally don't need to force a spill/reload with `"+m"`, using `"+r"` is a much more efficient way to make the compiler materialize a value and then assume the variable has changed. – Peter Cordes Oct 24 '19 at 11:59