Should the Results of Trig Functions Be Hardware Dependent with the Same OS / Compiler?

Question

I have a simple c++ program (foo.cxx):

#include <stdio.h>
#include <math.h>

int main()
{
    long int *p;

    double ang2 = -0.23202523431296057;
    p = (long int*)&ang2;
    printf("The bits of ang2 are %lx\n", *p);

    double sin_ang2 = sin(ang2);
    printf("sin_ang2 is %0.17f\n", sin_ang2);
    p = (long int*)&sin_ang2;
    printf("The bits of sin_ang2 are %lx\n", *p);
}

I have two different machines with different hardware, both at Ubuntu 20.04 and both with gcc at 9.3.0. On these two machines, I compile the above code with this command:

g++ -ffloat-store foo.cxx

On machine 1, the result of running the above program is:

The bits of ang2 are bfcdb300bc9c468a
sin_ang2 is -0.22994895724656178
The bits of sin_ang2 are bfcd6ef7a98fc7ce

On machine 2, the result of running the above program is:

The bits of ang2 are bfcdb300bc9c468a
sin_ang2 is -0.22994895724656181
The bits of sin_ang2 are bfcd6ef7a98fc7cf

Notice the slight difference in the results of calling sin() on these two machines. My question is whether or not this should be expected. I realize there are many nuances with floating point arithmetic that can lead to imprecise results, but is this an example of one? My understanding is that the -ffloat-store option to gcc could have helped deliver consistent results across machines, though it didn't seem to help here:

-ffloat-store

Do not store floating-point variables in registers, and inhibit other options that might change whether a floating-point value is taken from a register or memory.

This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.

The hardware for machine one (lscpu) is:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   36 bits physical, 48 bits virtual
CPU(s):                          4
...
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           58
Model name:                      Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
...
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtr
                                 r pge mca cmov pat pse36 clflush dts acpi mmx f
                                 xsr sse sse2 ss ht tm pbe syscall nx rdtscp lm 
                                 constant_tsc arch_perfmon pebs bts rep_good nop
                                 l xtopology nonstop_tsc cpuid aperfmperf pni pc
                                 lmulqdq dtes64 monitor ds_cpl vmx smx est tm2 s
                                 sse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic p
                                 opcnt tsc_deadline_timer aes xsave avx f16c rdr
                                 and lahf_lm cpuid_fault epb pti ssbd ibrs ibpb 
                                 stibp tpr_shadow vnmi flexpriority ept vpid fsg
                                 sbase smep erms xsaveopt dtherm ida arat pln pt
                                 s md_clear flush_l1d

And the hardware for machine 2 is:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          16
...
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           158
Model name:                      Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
...
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtr
                                 r pge mca cmov pat pse36 clflush dts acpi mmx f
                                 xsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rd
                                 tscp lm constant_tsc art arch_perfmon pebs bts 
                                 rep_good nopl xtopology nonstop_tsc cpuid aperf
                                 mperf pni pclmulqdq dtes64 monitor ds_cpl vmx s
                                 mx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid s
                                 se4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t
                                 imer aes xsave avx f16c rdrand lahf_lm abm 3dno
                                 wprefetch cpuid_fault epb invpcid_single ssbd i
                                 brs ibpb stibp ibrs_enhanced tpr_shadow vnmi fl
                                 expriority ept vpid ept_ad fsgsbase tsc_adjust 
                                 bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx
                                  smap clflushopt intel_pt xsaveopt xsavec xgetb
                                 v1 xsaves dtherm ida arat pln pts hwp hwp_notif
                                 y hwp_act_window hwp_epp md_clear flush_l1d arc
                                 h_capabilities

Any suggestions on ways to get consistent results across these two machines?

"Consistent results" and "floating point" are mutually exclusive. And, yes, different CPUs can certainly end up rounding results differently. — Sam Varshavchik, Jun 24 '21 at 16:37
If you need them to match, you can always switch to using signed integer arithmetic by storing the numbers on the left and the right of the decimal. E.g. 123.4567 stored as 123 and 4567. Not particularly efficient though. — Abstract, Jun 24 '21 at 16:46
Thanks to Thomas Matthews. I found this to be particularly enlightening: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#873 — BoulderPika, Jun 24 '21 at 17:06
Note that "the same" OS and Compiler means "the same version, but maybe compiled differently for the target hardware". Since it's different hardware, it's unlikely that they are the same binaries, i.e., they're not the same. — Pete Becker, Jun 24 '21 at 17:44
"*particularly enlightening*" You have found a very very good source of knowledge. Tne bottom line is that with 64 bits floats (even internally the CPU uses bigger size) you can not trust beyond the 15th digit; no matter different OS, libs or hardware. Even in one machine different executions may result in different last right-bits. — Ripi2, Jun 24 '21 at 18:14
@BoulderPika Which is more important: "get consistent results" or best result (sin_ang2 are bfcd6ef7a98fc7cf)? — chux - Reinstate Monica, Jun 24 '21 at 18:22
@Ripi2 is right, 15 digits is about all you can reasonably expect and you're displaying 17. See [Is floating point math broken?](https://stackoverflow.com/q/588004/5987) — Mark Ransom, Jun 24 '21 at 18:36
For unit tests on math stuff in the LibC I support, we always include a sigma value around excepted answers as they will vary a bit. This is expected and normal from what I know. — Michael Dorgan, Jun 24 '21 at 18:41
BoulderPika, Note: in this case, a better sine is bfcd6ef7a98fc7ce_82... if we had more digits. That answer is very nearly half-way between the 2 machine's results. Being at 51.0%, the larger answer is better. See [table makers dilemma](https://en.wikipedia.org/wiki/Rounding#Table-maker's_dilemma). — chux - Reinstate Monica, Jun 24 '21 at 18:44
OT: Consider using the [`%a` format specifier](https://stackoverflow.com/questions/4826842/the-format-specifier-a-for-printf-in-c) instead of type punning (e.g. https://godbolt.org/z/vPE94zcW7). — Bob__, Jun 24 '21 at 19:31

Piotr Praszmo · Answer 1 · 2023-03-01T12:34:50.690

Those CPUs do not differ enough to produce different results for same instructions. The difference you are seeing comes from different sin implementation in libc. The implementation is picked dynamically by linker based on what your CPU supports (__sin_avx or __sin_fma).

There is no straightforward way to disable this: Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

Should the Results of Trig Functions Be Hardware Dependent with the Same OS / Compiler?

1 Answers1