Fortran code compiled in one Windows machine (2018 Intel processor) gives different results when exe is copied to other machine (2022 Intel processor)

Question

Is it possible that a Fortran code compiled in one Windows machine with Visual studio 2019 on a 2018 Intel processor gives a slightly different result when the exe is copied to another machine (that has a 2022 Intel processor)? Could you please list the possible causes of this behavior?

Peter Cordes · Answer 1 · 2022-10-28T13:41:48.477

Run-time dispatch to different versions of functions based on CPU instruction sets, and/or auto-parallelization to all cores, are likely candidates.

Some compiler optimizations pretend that FP math is associative. In reality, FP math is not quite associative; different temporaries introduces different rounding error.

Different choices in how to find parallelism (like auto-vectorization) in loops like a dot-product or sum of an array can lead to different rounding, which can make the result numerically more accurate, but different from the source order of operations¹. Such loops can only be vectorized by pretending FP math is associative.

Auto-parallelization (e.g. OpenMP) with a different number of cores could break a problem up in a different way, into different sub-totals. If your program uses all cores, that's a likely candidate.

Intel compilers can also make code that dispatches to different versions of a function depending on which SIMD instruction sets are available. So you could have an SSE4 version, an AVX2+FMA version, and an AVX-512 version (perhaps even using 512-bit vectors.)

Different SIMD widths lead to a different number of accumulators, if the loop uses the same number of vector registers. So that's a different sets of numbers getting added together into subtotals, e.g. for a dot product.

Does only one of the processors have AVX-512? Or is one of them a Pentium or Celeron without AVX? If so, that's also a likely factor.

Runtime-dispatch in library functions like TBB or SVML could also be a factor, not just code directly generated by the compiler.

Intel compiler FP-math options

Intel compilers default to -fp-model=fast. See the docs for Intel Fortran and Intel C/C++ compiler (both classic, not LLVM-based OneAPI, although presumably they turn on -ffast-math by default in that). The C++ compiler docs seem more detailed in their descriptions.

(The other mainstream compilers, like LLVM and GCC (gfortran), default to -fno-fast-math. But if you use OpenMP, you can let the compiler treat the sum or product or whatever in a specific reduction loop as associative.)

Specifically the Intel default is fast=1, and there's an even more aggressive level of optimization, -fp-model=fast=2 that's not on by default.

See also Intel's 2018 article Consistency of Floating-Point Results using the Intel® Compiler or Why doesn’t my application always give the same answer?, covering the kinds of value-unsafe optimization Intel's compilers do and how the various FP-model switches affect that. And/or slides from a 2008 talk of the same title.

Quoting some descriptions from that article:

-fp-model=consistent - not necessarily the same order of operations as the source, but the same on every machine. And run-to-run consistency when auto-parallelizing with OpenMP. Otherwise with dynamic scheduling of a reduction, the way subtotals are added could depend on which threads finished which work first.
-fp-model=precise - allows value-safe optimizations only (but still including contraction of a*b+c into an FMA).
-fp-model=strict - enables access to the FPU environment (e.g. you can change the FP rounding mode and compiler optimizations will respect that possibilitiy).
disables floating-point contractions such as fused multiply-add (fma) instructions implies “precise” and “except

This part probably doesn't lead to optimizations that vary across machines, but it interesting:

The ANSI Fortran standard is less restrictive than the C standard: it requires the compiler to respect the order of evaluation specified by parentheses, but otherwise allows the compiler to reorder expressions as it sees fit. The Intel Fortran compiler has therefore implemented a corresponding switch, /assume:protect_parens (-assume protect_parens), that results in standardconforming behavior for reassociation, with considerably less impact on performance than /fp:precise (-fp-model precise). This switch does not affect any value-unsafe optimizations other than reassociation.

Footnote 1: different isn't always worse, numerically

Floating point math always has rounding error (except in rare cases, e.g. adding 5.0 + 3.0 or other cases where the mantissa have enough trailing zeros so no 1 bits have to get rounded away after shifting the mantissa to align them). The number you'd get from doing FP math operations in source order with source-specified precision will have rounding error.

Using multiple accumulators (from vectorizing and unrolling a reduction) is usually better numerically, a step in the direction of pairwise summation; a simple sum += a[i] is the worst way to add an array if the elements are all or mostly positive and of uniform size. Adding a small number to a large number loses a lot of precision, so having 16 or 32 different buckets to sum into means the totals aren't quite as big until you add the buckets together. See also Simd matmul program gives different numerical results.

You can jokingly call this wrong because it makes it harder to verify / test that a program does exactly what you think it does, and because it's not what the source says to do.

But unless you're doing stuff like Kahan summation (error-compensation) or double-double (extended precision) or other numerical techniques that depend on precise FP rounding semantics, fast-math optimizations merely give you answers that are wrong in a different way than the source, and mathematically maybe less wrong.

(Unless it also introduces some approximations like rsqrtps + a Newton iteration instead of x / sqrt(y). That may only be on with fast=2, which isn't the default. But I'm not sure. Some optimizations also might not care about turning -0.0 into +0.0.)

Intel Fortran also defaults to getting wrong answers quickly. — francescalus, Oct 28 '22 at 09:22
@francescalus :) Some might argue a naughty comment. I am assuming you mean it has the annoying default of optimization being turned *on* — Ian Bush, Oct 28 '22 at 10:38
@IanBush: not just optimization on, which is fine and doesn't change results (in correct programs, absent compiler bugs). But also that it defaults to something similar to GCC's `-ffast-math`, although maybe not quite as aggressive since there are two levels of fast, and the default is only `-fp-model=fast=1`. Of course, in some cases multiple accumulators give *better* answers than sequentially adding a series of numbers into one `sum += a[i] * b[i]` ; that's the worst way to do it for rounding error if the sequence is uniform and positive. See updated answer. — Peter Cordes, Oct 28 '22 at 13:42
That whitepaper in particular is very informative for those interested to learn more. — Ross, Oct 28 '22 at 18:24
@njuffa: I was mostly focusing on things important for vectorization, like reduction loops. I think usually using 16 instead of 32 accumulators for example (4 256-bit vectors instead of 4 128-bit vectors) for summing an array or dot-product, for example, won't make a *big* difference, or even vs. a single accumulator unless your data is a specific mix of + / - or different magnitudes. I could imagine more significant differences from re-associating within an expression if there are variables known to have very different magnitudes; yeah it can matter a lot in some cases. — Peter Cordes, Oct 28 '22 at 21:55
Also, the OP reported their results were only slightly different between machines. Anyway, thanks for the comment, but I think if it's a big deal for your code that's usually because you're intentionally thinking about managing FP rounding error, like I mentioned with Kahan summation or similar techniques. Otherwise source order might only happen to be good, or might *not* be good. I think I'm going to leave my answer as it stands; folks who know anything about floating point rounding error should understand the implications, and I don't want to get into the weeds of numerical techniques. — Peter Cordes, Oct 28 '22 at 22:00

Fortran code compiled in one Windows machine (2018 Intel processor) gives different results when exe is copied to other machine (2022 Intel processor)

1 Answers1

Intel compiler FP-math options

Footnote 1: different isn't always worse, numerically