How can floating point calculations be made deterministic?

Question

Floating point calculation is neither associative nor distributive on processors. So,

(a + b) + c is not equal to a + (b + c)

and a * (b + c) is not equal to a * b + a * c

Is there any way to perform deterministic floating point calculation that do not give different results. It would be deterministic on uniprocessor ofcourse, but it would not be deterministic in multithreaded programs if threads add to a sum for example, as there might be different interleavings of the threads.

So my question is, how can one achieve deterministic results for floating point calculations in multithreaded programs?

Good question. Although the answer is probably "you can't" or "use arbitrary precision arithmetic", it is legitimate to ask it. — Alexandre C., Sep 09 '11 at 18:18
Also, what kind of field do you need this for ? There are disciplines where this is a real issue (computational geometry for instance) and others for which there is no problem with floating point computations as they are (most fields actually, with some workarounds where this really matters). — Alexandre C., Sep 09 '11 at 18:19
"`a + (b*c)` is not equal to `a*b + a*c`" -- are they ever equal? — Austin Salonen, Sep 09 '11 at 18:20
Alexander, I need it for deterministic execution, that is if I run my program several times, it gives the same output. This eases debugging. — MetallicPriest, Sep 09 '11 at 18:20
Any solution to this problem will either incur a dramatic performance penalty, or will be very cumbersome to implement. Are you sure you cannot get away with some roundoff error ? — Alexandre C., Sep 09 '11 at 18:21
but you execute the same code.. so what's the difference? I guess the order of data.. can you force a specific order, like first sorting the data set? — Karoly Horvath, Sep 09 '11 at 18:23
If your problem allows it, you could use rational numbers (integral numerator and denominator), which would always be precise. But you cannot use those in irrational functions, of course. — Kerrek SB, Sep 09 '11 at 18:52

Stephen Canon · Accepted Answer · 2011-09-09T19:04:41.933

Floating-point is deterministic. The same floating-point operations, run on the same hardware, always produces the same result. There is no black magic, noise, randomness, fuzzing, or any of the other things that people commonly attribute to floating-point. The tooth fairy does not show up, take the low bits of your result, and leave a quarter under your pillow.

Now, that said, certain blocked algorithms that are commonly used for large-scale parallel computations are non-deterministic in terms of the order in which floating-point computations are performed, which can result in non-bit-exact results across runs.

What can you do about it?

First, make sure that you actually can't live with the situation. Many things that you might try to enforce ordering in a parallel computation will hurt performance. That's just how it is.

I would also note that although blocked algorithms may introduce some amount of non-determinism, they frequently deliver results with smaller rounding errors than do naive unblocked serial algorithms (surprising but true!). If you can live with the errors produced by a naive serial algorithm, you can probably live with the errors of a parallel blocked algorithm.

Now, if you really, truly, need exact reproducibility across runs, here are a few suggestions that tend not to adversely affect performance too much:

Don't use multithreaded algorithms that can reorder floating-point computations. Problem solved. This doesn't mean you can't use multithreaded algorithms at all, merely that you need to ensure that each individual result is only touched by a single thread between synchronization points. Note that this can actually improve performance on some architectures if done properly, by reducing D$ contention between cores.
In reduction operations, you can have each thread store its result to an indexed location in an array, wait for all threads to finish, the accumulate the elements of the array in order. This adds a small amount of memory overhead, but is generally pretty tolerable, especially when the number of threads is "small".
Find ways to hoist the parallelism. Instead of computing 24 matrix multiplications, each one of which uses parallel algorithms, compute 24 matrix products in parallel, each one of which uses a serial algorithm. This, too, can be beneficial for performance (sometimes enormously so).

There are lots of other ways to handle this. They all require thought and care. Parallel programming usually does.

It's only deterministic at the machine code level though. A given C program compiled twice may give different results. — Antimony, Sep 02 '12 at 18:53
@Antimony I am quite confused, it seems you and Stephen are saying different things. So a+b+c is or is not the same as b+c+a? — Herman Toothrot, Sep 06 '16 at 23:20
@HermanToothrot `a+b+c` is not the same as `b+c+a` regardless of determinism, results may differ even on the same processor. It's because floating-point arithmetic is not associative, one cannot reorder operands. Point is, even `a+b+c` parsed as `(a+b)+c` in C can de-facto yield different results even with the exact same inputs, bitwise. — yeputons, Jul 16 '23 at 21:52

R.. GitHub STOP HELPING ICE · Answer 2 · 2011-09-09T19:17:54.890

3

Edit: I've removed my old answer since I seem to have misunderstood OP's question. If you want to see it you can read the edit history.

I think the ideal solution would be to switch to having a separate accumulator for each thread. This avoids all locking, which should make a drastic difference to performance. You can simply sum the accumulators at the end of the whole operation.

Alternatively, if you insist on using a single accumulator, one solution is to use "fixed-point" rather than floating point. This can be done with floating-point types by including a giant "bias" term in your accumulator to lock the exponent at a fixed value. For example if you know the accumulator will never exceed 2^32, you can start the accumulator at 0x1p32. This will lock you at 32 bits of precision to the left of the radix point, and 20 bits of fractional precision (assuming double). If that's not enough precision, you could us a smaller bias (assuming the accumulator will not grow too large) or switch to long double. If long double is 80-bit extended format, a bias of 2^32 would give 31 bits of fractional precision.

Then, whenever you want to actually "use" the value of the accumulator, simply subtract out the bias term.

edited Sep 09 '11 at 19:17

answered Sep 09 '11 at 18:54

R.. GitHub STOP HELPING ICE

208,859
35
376
711

He's talking about scenarios like this: you have two threads, each of which adds a floating-point value to the same accumulator. If you don't control the order in which the additions occur, you might get `(accumulator + thread A value) + thread B value` or `(accumulator + thread B value) + thread A value`, which may not be equal due to rounding. – Stephen Canon Sep 09 '11 at 19:00
OK, the question was not very clear then. Anyway, it sounds to me like each thread should just have its own accumulator then, and they can all be added together at the end. – R.. GitHub STOP HELPING ICE Sep 09 '11 at 19:06
That would also avoid the locking on the accumulator, which is almost surely making the whole process an order of magnitude slower than if it were just single-threaded... – R.. GitHub STOP HELPING ICE Sep 09 '11 at 19:10
@Stephen: I've replaced my answer with some new ideas. – R.. GitHub STOP HELPING ICE Sep 09 '11 at 19:18

score 2 · Answer 3 · edited Mar 15 '21 at 04:37

2

Even using a high-precision fixed point datatype would not solve the problem of making the results for said equations determinisic (except in certain cases). As Keith Thompson pointed out in a comment, 1/3 is a trivial counter-example of a value that cannot be stored correctly in either a standard base-10 or base-2 floating point representation (regardless of precision or memory used).

One solution that, depending upon particular needs, may address this issue (it still has limits) is to use a Rational number data-type (one that stores both a numerator and denominator). Keith suggested GMP as one such library:

GMP is a free library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating point numbers. There is no practical limit to the precision...

Whether it is suitable (or adequate) for this task is another story...

Happy coding.

edited Mar 15 '21 at 04:37

phuclv

37,963
15
156
475

answered Sep 09 '11 at 18:35

3

Inexactness of 1.0/3.0 has nothing to do with non-determinism. – R.. GitHub STOP HELPING ICE Sep 09 '11 at 19:53
2

It has everything to do with it - if not for that, then it wouldn't matter if you have 1/3+1/3+1/3 or (1+1+1)/3, so a non-deterministic choice between those two methods would not make the outcome non-deterministic. – Random832 Sep 09 '11 at 20:53
@R.. Caught me! :) I used it there to show that not all numbers can be stored in a finite space using standard encoding. Since there is no division in the questions it is less useful then it could be as an example, however. Assuming a bounded precision it's still easy to come up with examples when the order of FP operations matters over `+` and `*`. – Sep 09 '11 at 20:53
I was treating 1.0/3.0 as an atomic quantity not a division. – R.. GitHub STOP HELPING ICE Sep 09 '11 at 21:30
@R.. But it's not an *exact* atomic quantity in standard mantissa-exponent floating point or fixed precision floating point. If there was a division it would be easy to argue the case provided by Random8321: the intermediate values are simply *not adequately representable in that form with any sum of memory*. With just addition and multiplication I think the argument has to be dropped to "for a bound precision". – Sep 09 '11 at 21:59

score 1 · Answer 4 · answered Sep 09 '11 at 18:17

1

Use a decimal type or library supporting such a type.

answered Sep 09 '11 at 18:17

Burton Samograd

3,652
19
21

It's still not "deterministic". The order of operations, for any finite storage number, still matter. (However, using a fixed *high-precision* type can help.) – Sep 09 '11 at 18:20
pst, fixed high precision type? what do you mean by that? – MetallicPriest Sep 09 '11 at 18:24
@Metallic There is a level of precision that a finite set of memory can't store. – Sep 09 '11 at 18:29
A decimal type won't help if you need to compute 1.0/3.0. – Keith Thompson Sep 09 '11 at 18:29
@Keith Thompson But a Rational would ;-) – Sep 09 '11 at 18:29
@pst: Yes, if you store the numerator and denominator with sufficient precision. GMP might be a solution. – Keith Thompson Sep 09 '11 at 18:31
3

Decimal does nothing to address this issue. Non-associativity is not unique to binary floating-point formats. – Stephen Canon Sep 09 '11 at 19:03

Keith Thompson · Answer 5 · 2011-09-09T20:44:53.710

-2

Try storing each intermediate result in a volatile object:

volatile double a_plus_b = a + b;
volatile double a_plus_b_plus_c = a_plus_b + c;

This is likely to have nasty effects on performance. I suggest measuring both versions.

EDIT: The purpose of volatile is to inhibit optimizations that might affect the results even in a single-threaded environment, such as ~~changing the order of operations or~~ storing intermediate results in wider registers. It doesn't address multi-threading issues.

EDIT2: Something else to consider is that

A ﬂoating expression may be contracted, that is, evaluated as though it were an atomic operation, thereby omitting rounding errors implied by the source code and the expression evaluation method.

This can be inhibited by using

#include <math.h>
...
#pragma STDC FP_CONTRACT off

Reference: C99 standard (large PDF), sections 7.12.2 and 6.5 paragraph 8. This is C99-specific; some compilers might not support it.

edited Sep 09 '11 at 20:44

answered Sep 09 '11 at 18:29

Keith Thompson

254,901
44
429
631

What if I have multiple threads adding to a global sum (associative property)? That would create non-determinism as threads might add to the sum in different orders each time. – MetallicPriest Sep 09 '11 at 18:30
@MetallicPriest: Yes, that would indeed create non-determinism. Consider having the threads append their results to a list of some sort (with appropriate code to avoid messing up the data structure). When you're done, sort the list and sum it. – Keith Thompson Sep 09 '11 at 18:33
3

I can not see how `volatile` would help here. He is already doing lock() and unlock() which hopefully synchronize the memory. The problem is that different threads complete the work in different order and volatile does not change that. – snap Sep 09 '11 at 18:34
My thought was that even in a single-threaded environment, the compiler might perform optimizations that could alter the results; storing each intermediate result in a volatile object would inhibit those optimizations. For example, some systems have 64-bit double, but store *some* intermediate results in 80-bit registers. The use of `volatile` wasn't intended to fix the threading problem (I should have mentioned that). – Keith Thompson Sep 09 '11 at 19:21
1

The compiler cannot reorder floating-point operations that might change the result unless the user licenses it to do so, with or without the `volatile` keyword. – Stephen Canon Sep 09 '11 at 19:36
1

@Stephen: Good point. The order of evaluation of operands is not defined, but that just means that `x + y` can evaluate x then y, or y then x; it doesn't affect the result. The association of operands with operators is well defined. – Keith Thompson Sep 09 '11 at 20:46

score -4 · Answer 6 · answered Sep 09 '11 at 18:36

-4

Use packed decimal.

answered Sep 09 '11 at 18:36

EvilTeach

28,120
21
85
141

1

Decimal does nothing to address this issue. Non-associativity is not unique to binary floating-point formats. – Stephen Canon Sep 09 '11 at 19:03
Every floating-point format (even decimal formats) incurs rounding (the only alternative would be to use an unbounded number of digits). – Stephen Canon Sep 09 '11 at 21:32

How can floating point calculations be made deterministic?

6 Answers6

Linked