Enforcing order of execution

Question

I would like to ensure that the calculations requested are executed exactly in the order I specify, without any alterations from either the compiler or CPU (including the linker, assembler, and anything else you can think of).

Operator left-to-right associativity is assumed in the C language

I am working in C (possibly also interested in C++ solutions), which states that for operations of equal precedence there is an assumed left-to-right operator associativity, and hence
a = b + c - d + e + f - g ...;
is equivalent to
a = (...(((((b + c) - d) + e) + f) - g) ...);

A small example

However, consider the following example:

double a, b = -2, c = -3;
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;

So many opportunities for optimisation

For many compilers and pre-processors they may be clever enough to recognise the "+ 2 - 2" is redundant and optimise this away. Similarly they could recognise that the "+= 2*b" followed by the "+= c" can be written using a single FMA. Even if they don't optimise in an FMA, they may switch the order of these operations etc. Furthermore, if the compiler doesn't do any of these optimisations, the CPU may well decide to do some out of order execution, and decide it can do the "+= c" before the "+= 2*b", etc.

As floating-point arithmetic is non-associative, each type of optimisation may result in a different end result, which may be noticeable if the following is inlined somewhere.

Why worry about floating point associativity?

For most of my code I would like as much optimisation as I can have and don't care about floating-point associativity or bit-wise reproduciblilty, but occasionally there is a small snippet (similar to the above example) which I would like to be untampered with and totally respected. This is because I am working with a mathematical method which exactly requires a reproducible result.

What can I do to resolve this?

A few ideas which have come to mind:

Disable compiler optimisations and out of order execution
- I don't want this, as I want the other 99% of my code to be heavily optimised. (This seems to be cutting off my nose to spite my face). I also most likely won't have permission to change my hardware settings.
Use a pragma
Write some assembly
- The code snippets are small enough that this might be reasonable, although I'm not very confident in this, especially if (when) it comes to debugging.
Put this in a separate file, compile separately as un-optimised as possible, and then link using a function call
Volatile variables
- To my mind these are just for ensuring that memory access is respected and un-optimised, but perhaps they might prove useful.
Access everything through judicious use of pointers
- Perhaps, but this seems like a disaster in readability, performance, and bugs waiting to happen.

If anyone can think of any feasibly solutions (either from any of the ideas I've suggested or otherwise) that would be ideal. The "pragma" option or "function call" to my mind seem like the best approaches.

The ultimate goal

To have something that marks off a small chuck of simple and largely vanilla C code as protected and untouchable to any (realistically most) optimisations, while allowing for the rest of the code to be heavily optimised, covering optimisations from both the CPU and compiler.

Why is there a C++ flag if this is for C? They are very different languages, although the solutions here will be compiler specific. — Matthieu Brucher, Feb 15 '19 at 16:49
@FrançoisAndrieux it can, the Intel compiler is known to optimize heavily and change the associativity if it makes the code faster according to its heuristics. I suspect the other do as well with fast math active. — Matthieu Brucher, Feb 15 '19 at 16:50
@FrançoisAndrieux in case of undefined behaviour, optimization can change the end result — Jean-François Fabre, Feb 15 '19 at 16:51
@FrançoisAndrieux It most certainly can. As the most obvious counter-example, consider code that measures its own performance. His issue is that the standards permit a variety of results and he wants to choose a particular one of them. — David Schwartz, Feb 15 '19 at 16:52
@MatthieuBrucher The first sentence: "I am working in C (possibly also interested in C++ solutions)". So the OP is looking for answers to both. — jamesdlin, Feb 15 '19 at 16:52
Do you have an example of code that doesn't behave as you expect? If so post it, along with your compiler name / version / settings. — dbush, Feb 15 '19 at 16:53
Not sure if the compiler can/will optimize inline assembly but if it doesn't that seems like something you could use. — NathanOliver, Feb 15 '19 at 16:54
I think it is trivial to "Disable compiler optimisations" for the 1% of the code, yet compile the other 99% at full optimization. Did you have some problem when you tried this? — 2785528, Feb 15 '19 at 16:55
@IłyaBursov actually `consistent` instead of `precise` may be even better (especially for FMA). — Matthieu Brucher, Feb 15 '19 at 16:58
@FrançoisAndrieux if you want maximum performance, that's what you want the compiler to do. But that's a choice between IEEE compliancy and performance. Usually, you want a tradeoff, because pure IEEE is going to be dead slow. — Matthieu Brucher, Feb 15 '19 at 17:00
@FrançoisAndrieux: Optimization may not change **fully defined** observable behavior. If the behavior is not defined at all by the C or C++ standard (whichever is applicable), then optimization is free to change it. If the behavior is partially defined, so that multiple specific behaviors are permitted by the standard regardless of whether or not optimization is enabled, then optimization may change the observable behavior among these options. Both C and C++ allow latitude in the evaluation of floating-point expressions, so optimization has wiggle room to change the observable behavior. — Eric Postpischil, Feb 15 '19 at 17:16
@EricPostpischil and others, seems I hadn't considered how relaxed the standard is regarding floating point types. — François Andrieux, Feb 15 '19 at 17:18
If your compiler does not allow associativity-based optimizations, I think that the kahan summation algorithm would slightly improve the reproduciblilty. — Hiroki, Feb 15 '19 at 18:15

score 4 · Answer 1 · edited Jun 20 '20 at 09:12

This is not a complete answer, but it is informative, partially answers, and is too long for a comment.

Clarifying the Goal

The question actually seeks reproducibility of floating-point results, not order of execution. Also, order of execution is irrelevant; we do not care if, in (a+b)+(c+d), a+b or c+d is executed first. We care that the result of a+b is added to the result of c+d, without any reassociation or other rewriting of arithmetic unless the result is known to be the same.

Reproducibility of floating-point arithmetic is in general an unsolved technological problem. (There is no theoretical barrier; we have reproducible elementary operations. Reproducibility is a matter of what hardware and software vendors have provided and how hard it is to express the computations we want performed.)

Do you want reproducibility on one platform (e.g., always using the same version of the same math library)? Does your code use any math library routines like sin or log? Do you want reproducibility across different platforms? With multithreading? Across changes of compiler version?

Addressing Some Specific Issues

The samples shown in the question can largely be handled by writing each individual floating-point operation in its own statement, as by replacing:

a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;

with:

t0 = 1 + 2;
t0 = t0 - 2;
t0 = t0 + 3;
t0 = t0 + 4;
t1 = 2*b;
t0 += t1;
a += c;

The basis for this is that both C and C++ permit an implementation to use “excess precision” when evaluating an expression but require that precision to be “discarded” when an assignment or cast is performed. Limiting each assignment expression to one operation or executing a cast after each operation effectively isolates the operations.

In many cases, a compiler will then generate code using instructions of the nominal type, instead of instructions using a type with excess precision. In particular, this should avoid a fused multiply-add (FMA) being substituted for a multiplication followed by an addition. (An FMA has effectively infinite precision in the product before it is added to the addend, thus falling under the “excess precision is permitted” rule.) There are caveats, however. An implementation might first evaluate an operation with excess precision and then round it to the nominal precision. In general, this can cause a different result than doing a single operation in the nominal precision. For the elementary operations of addition, subtract, multiplication, division, and even square root, this does not happen if the excess precision is sufficient greater than the nominal precision. (There are proofs that a result with sufficient excess precision is always close enough to the infinitely precise result that the rounding to nominal precision gets the same result.) This is true for the case where the nominal precision is the IEEE-754 basic 32-bit binary floating-point format, and the excess precision is the 64-bit format. However, it is not true where the nominal precision is the 64-bit format and the excess precision is Intel’s 80-bit format.

So, whether this workaround works depends on the platform.

Other Issues

Aside from the use of excess precision and features like FMA or the optimizer rewriting expressions, there are other things that affect reproducibility, such as non-standard treatment of subnormals (notably replacing them with zeroes), variations between math library routines. (sin, log, and similar functions return different results on different platforms. Nobody has fully implemented correctly rounded math library routines with known bounded performance.)

These are discussed in other Stack Overflow questions about floating-point reproducibility, as well as papers, specifications, and standards documents.

Irrelevant Issues

The order in which a processor executes floating-point operations is irrelevant. Processor reordering of calculations obeys rigid semantics; the results are identical regardless of the chronological order of execution. (Processor timing can affect results if, for example, a task is partitioned into subtasks, such as assigning multiple threads or processes to process different parts of the arrays. Among other issues, their results could arrive in different orders, and the process receiving their results might then add or otherwise combine their results in different orders.)

Using pointers will not fix anything. As far as C or C++ is concerned, *p where p is a pointer to double is the same as a where a is a double. One the objects has a name (a) and one of them does not, but they are like roses: They smell the same. (There are issues where, if you have some other pointer q, the compiler might not know whether *q and *p refer to the same thing. But that also holds true for *q and a.)

Using volatile qualifiers will not aid in reproducibility regarding the excess precision or expression rewriting issue. That is because only an object (not a value) is volatile, which means it has no effect until you write it or read it. But, if you write it, you are using an assignment expression¹, so the rule about discarding excess precision already applies. When reading the object, you would force the compiler to retrieve the actual value from memory, but this value will not be any different than the non-volatile object has after assignment, so nothing is accomplished.

Footnote

¹ I would have to check on other things that modify an object, such as ++, but those are likely not significant for this discussion.

score 2 · Accepted Answer · answered Feb 15 '19 at 17:44

Write this critical chunk of code in assembly language.

The situation you're in is unusual. Most of the time people want the compiler to do optimizations, so compiler developers don't spend much development effort on means to avoid them. Even with the knobs you do get (pragmas, separate compilation, indirections, ...) you can never be sure something won't be optimized. Some of the undesirable optimizations you mention (constant folding, for instance) cannot be turned off by any means in modern compilers.

If you use assembly language you can be sure you're getting exactly what you wrote. If you do it any other way you won't have that level of confidence.

Also, link it in dynamically, so that whole-program link-time optimization wont rearrange the assembly. :( — Mooing Duck, Feb 15 '19 at 17:48

Christophe · Answer 3 · 2019-02-15T17:43:31.227

1

"clever enough to recognise the + 2 - 2 is redundant and optimise this away"

No ! All decent compilers will apply constant propagation and figure out that a is constant and optimize all your statement away, into something equivalent to a = 1;. Here the example with assembly.

Now if you make a volatile, the compiler has to assume that any change of a could have an impact outside the C++ programme. Constant propagation will still be performed to optimise each of these calculations, but the intermediary assignments are guaranteed to happen. Here the example with assembly.

If you don't want constant propagation to happen, you need to deactivate optimizations. In this case, the best would be to keep your code separate so to compile the rest with all optilizations on.

However this is not ideal. The optimizer could outperform you and with this approach, you'll loose global optimisation across the function boundaries.

Recommendation/quote of the day:

Don't diddle code; Find better algorithms
- B.W.Kernighan & P.J.Plauger

edited Feb 15 '19 at 17:43

answered Feb 15 '19 at 17:30

Christophe

68,716
7
72
138

1

"Find better algorithms" seems a little aggressive, since what the OP appears to actually want is just a IEEE-conforming result. "I am working with a mathematical method which exactly requires a reproducible result." (Also, presumably, a _precise_ reproducible result.) – Mooing Duck Feb 15 '19 at 17:50
@MooingDuck Thanks for giving the opportunity to clarify ! I didn't intend to be aggressive in any ways. Better doesn't imply that the current algo is bad. However I wanted to raise awareness about the fact that tinkering with code (even going assembler for numerical processing) might not be the best thing to do (portability, maintainability). Floating point arithmetic is discrete by nature and the algorithm shall take this into account. If needed, there's a header about numeric limits, in order to allow algorithms to be aware of the degree of imprecision of the calculations. – Christophe Feb 15 '19 at 18:15
1

I don't think this question is about the discrete-ness of floating point, but about how every compiler violates the IEEE spec to make faster code, which results in them returning the _wrong discrete value_. See `/fp:strict` for MSVC and `-msse2` for GCC. https://learn.microsoft.com/en-us/cpp/build/reference/fp-specify-floating-point-behavior?view=vs-2017 https://stackoverflow.com/questions/7295861/enabling-strict-floating-point-mode-in-gcc – Mooing Duck Feb 15 '19 at 18:24