2

I know that similar precision questions have been asked here however I am reading a code of a project that is doing an exact equality comparison among floats and is puzzling me.

Assume that x1 and x2 are of type numpy.ndarray and of dtype np.float32. These two variables have been computed by the same code executed on the same data but x1 has been computed by one machine and x2 by another (this is done on an AWS cluster which communicates with MPI).

Then the values are compared as follows

numpy.array_equal(x1, x2)

Hence, exact equality (no tolerance) is crucial for this program to work and it seems to work fine. This is confusing me. How can one compare two np.float32 computed on different machines and face no precision issues? When can these two (or more) floats can be equal?

mgus
  • 808
  • 4
  • 17
  • 39
  • 2
    Floating-point rounding error isn't random. It's sensitive to a lot of subtle things, including execution order, hardware, thread scheduling for parallel algorithms, compiler settings, use of vectorized instructions, etc., but it's not random. It's difficult, but possible, to get reproducible results. – user2357112 Jun 02 '20 at 07:27
  • A lot of stuff is already likely to be the same across machines in a cluster, but that doesn't mean your results will be reproducible after the next software update or hardware change, or that you'll get reproducible results if you introduce finer-grained parallelism. – user2357112 Jun 02 '20 at 07:31

1 Answers1

1

The arithmetic specified by IEEE-754 is deterministic given certain constraints discussed in its clause 11 (2008 version), including suitable rules for expression evaluation (such as unambiguous translation from expressions in a programming language to IEEE-754 operations, such as a+b+c must give (a+b)+c, not a+(b+c)).

If parallelism is not used or is constructed suitably, such as always partitioning a job into the same pieces and combining their results in the same way regardless of order of completion of computations, then obtaining identical results is not surprising.

Some factors that prevent reproducibility include varying parallelism, using different math libraries (with different implementations of functions such as pow), and using languages that are not strict about floating-point evaluation (such as permitting, but not requiring, extra precision).

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312