Do FMA instructions of different CPUs have different intermediate accuracy? If yes, then how does a compiler equalize the floating-point behavior?

Question

When I run fma-optimized horner-scheme polynomial computation (for cosine approximation), it makes 0.161 ulps error on FX8150 but 0.154 ulps on godbolt.org server despite the lack of -ffast-math (GCC).

If this is caused by hardware, and if accuracy is different per hardware, what does a C++ compiler do to maintain floating-point accuracy between different machines?

Is there only a minimum accuracy requirement by programming language specs so that any cpu vendor can boost precision as high as they want?

Minimal reproducible sample:

#include<iostream>
        // only optimized for [-1,1] input range
        template<typename Type, int Simd>
        inline
        void cosFast(Type * const __restrict__ data, Type * const __restrict__ result) noexcept
        {
            alignas(64)
            Type xSqr[Simd];
            
            for(int i=0;i<Simd;i++)
            {
                xSqr[i] =   data[i]*data[i];
            }   
            for(int i=0;i<Simd;i++)
            {
                result[i] =     Type(2.425144155360214881511638e-05);
            }
            for(int i=0;i<Simd;i++)
            {
                result[i] =     result[i]*xSqr[i] + Type(-0.001388599083010255696990498);
            }
            for(int i=0;i<Simd;i++)
            {
                result[i] =     result[i]*xSqr[i] + Type(0.04166657759826541962411284);
            }       
            for(int i=0;i<Simd;i++)
            {
                result[i] =     result[i]*xSqr[i] + Type(-0.4999999436679569697616898);
            }       
            for(int i=0;i<Simd;i++)
            {
                result[i] =     result[i]*xSqr[i] + Type(0.9999999821855363180134191);
            }


        }


#include<cstring>
template<typename T>
uint32_t GetUlpDifference(T a, T b)
{
    uint32_t aBitValue;
    uint32_t bBitValue;
    std::memcpy(&aBitValue,&a,sizeof(T));
    std::memcpy(&bBitValue,&b,sizeof(T));
    return (aBitValue > bBitValue) ?
           (aBitValue - bBitValue) :
           (bBitValue - aBitValue);
}
#include<vector>
template<typename Type>
float computeULP(std::vector<Type> real, std::vector<Type> approximation)
{
    int ctr = 0;
    Type diffSum = 0;
    for(auto r:real)
    {
        Type diff = GetUlpDifference(r,approximation[ctr++]);
        diffSum += diff;
    }
    return diffSum/ctr;
}

template<typename Type>
float computeMaxULP(std::vector<Type> real, std::vector<Type> approximation)
{
    int ctr = 0;
    Type mx = 0;
    int index = -1;
    Type rr = 0;
    Type aa = 0;
    for(auto r:real)
    {
        Type diff = GetUlpDifference(r,approximation[ctr++]);
        if(mx<diff)
        {
            mx = diff;
            rr=r;
            aa=approximation[ctr-1];
            index = ctr-1;
        }
    }
    std::cout<<"("<<index<<":"<<rr<<"<-->"<<aa<<")";
    return mx;
}
#include<cmath>
void test()
{
    constexpr int n = 8192*64;
    std::vector<float> a(n),b(n),c(n);
    for(int i=0;i<n;i++)
        a[i]=(i-(n/2))/(float)(n/2);

    // approximation
    for(int i=0;i<n;i+=16)
        cosFast<float,16>(a.data()+i,b.data()+i);

    // exact
    for(int i=0;i<n;i++)
        c[i] = std::cos(a[i]);
    
    std::cout<<"avg. ulps: "<<computeULP(b,c)<<std::endl;
    std::cout<<"max. ulps: "<<computeMaxULP(b,c)<<std::endl;
}

int main()
{
    test();
    return 0;
}

proof that it uses FMA:

https://godbolt.org/z/Y4qYMoxcn

.L23:
    vmovups ymm3, YMMWORD PTR [r12+rax]
    vmovups ymm2, YMMWORD PTR [r12+32+rax]
    vmulps  ymm3, ymm3, ymm3
    vmulps  ymm2, ymm2, ymm2
    vmovaps ymm1, ymm3
    vmovaps ymm0, ymm2
    vfmadd132ps     ymm1, ymm7, ymm8
    vfmadd132ps     ymm0, ymm7, ymm8
    vfmadd132ps     ymm1, ymm6, ymm3
    vfmadd132ps     ymm0, ymm6, ymm2
    vfmadd132ps     ymm1, ymm5, ymm3
    vfmadd132ps     ymm0, ymm5, ymm2
    vfmadd132ps     ymm1, ymm4, ymm3
    vfmadd132ps     ymm0, ymm4, ymm2
    vmovups YMMWORD PTR [r13+0+rax], ymm1
    vmovups YMMWORD PTR [r13+32+rax], ymm0
    add     rax, 64
    cmp     rax, 2097152
    jne     .L23

this instance (I don't know if it was xeon or epyc) further improved it to 0.152 ulps average.

Is your question answered by https://stackoverflow.com/questions/34294938/does-the-c-standard-specify-anything-on-the-representation-of-floating-point-n? — Maxpm, May 07 '22 at 09:25
```However, implementing this annex is optional; the core standard specifically avoids saying anything about the representation of floating point numbers``` so they have nothing to mandate on hardware vendors. — huseyin tugrul buyukisik, May 07 '22 at 09:27
Code or it didn't happen :-) There is an extremely high likelihood that FMAs operating on IEEE-754 `binary32` and `binary64` data on different processors deliver the exact same results for non-special operands. However, it is quite possible that (1) software has configured the FPUs differently (e.g. flush-to-zero or rounding modes), (2) the compiler applied FMA operations in a different ordering or not at all (e.g. due to compiler switches, optimization settings). You would want to carefully examine these software factors at the machine code level before investigating the hardware. — njuffa, May 07 '22 at 10:16
https://godbolt.org/z/59978TzEf and this is the full code: https://godbolt.org/z/9Tcjq4br5 I'm having problems with precisely reducing range to [-pi,pi] so this version only is for [-1,1] range which is easy to have less error (perhaps even 0 ulps) but it is interesting to see different CPUs giving different accuracy with same code. (those are average ulps, not max that is 1) — huseyin tugrul buyukisik, May 07 '22 at 10:34
Your title asks about FMA instructions but the body of your question asks about C++ compiler behavior. An FMA instruction computes a•b+c as if it were computed to infinite precision, including the intermediate result, and then rounded to the nearest representable result (as directed by the rounding method) or it is not an FMA instruction. Whether a compiler uses such an instruction or other instructions needed to provide the desired results is a different question. — Eric Postpischil, May 07 '22 at 12:30
Edit the question to provide a [mre], including in the question directly, not merely as links to an external site, complete code that reproduces the problem and input data that reproduces the problem. Use the usual debugging techniques to identify the first place in the code where the two implementations diverge. (Complete code for the [mre] should be the minimal complete code needed to reproduce the the problem, not the complete code of your program.) — Eric Postpischil, May 07 '22 at 12:31
@EricPostpischil then one can use complex-fma of Apple M1 Max and have even higher intermediate accuracy (complex = 2 fmas joiner right)? — huseyin tugrul buyukisik, May 07 '22 at 12:47
@huseyintugrulbuyukisik: Do you mean the FCMLA instruction? That is just two real FMAs. Each one produces its result as described above. — Eric Postpischil, May 07 '22 at 13:01

score 1 · Accepted Answer · answered May 07 '22 at 12:55

Regarding the C++ language, there are no strong requirements and it is mostly implementation-defined as stated in the previous answer pointed out by @Maxpm in the comments.

The main standard for floating-point precision is the IEEE-754. It is generally properly implemented by most vendors nowadays (at least nearly all recent mainstream x86-64 CPUs and most mainstream GPUs). It is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.

The IEEE-754 standard requires operations to be correctly computed (ie. error less than 1 ULP) using the correct rounding method. There are different rounding methods supported by the norm but the most common is the round to nearest. The standard also require some operation like FMA to be implemented with the same requirements. As a result, you cannot expect results to be computed with a precision better than 1 ULP per operation with this standard (rounding may help to reach 0.5 ULP in average or even better regarding the actual algorithm used).

In practice, computing units of IEEE-754-compliant hardware vendors use a higher precision internally so to fulfil the requirements whatever the provided input. Still, when results are stored in memory, then they need to be rounded properly the way the IEEE-754. On x86-64 processors, SIMD registers like the one of SSE, AVX and AVX-512 have a well-known fixed size. Each lane are either 16-bit (half-float), 32-bit (float) or 64-bit (double) for floating-point operations. The IEEE-754 compliant rounding should applied for each instructions. While processors could theoretically implement clever optimizations like fusing two FP instructions in one (as long as the precision is <1 ULP), AFAIK none does that yet (though fusing is done for some instructions like conditional branches).

Difference between IEEE-754 platforms can be due to either to the compiler or the configuration of FP units of the hardware vendor.

Regarding the compiler, optimizations can increase the precision while being IEEE-754 compliant. For example, the use of FMA instruction in your code is an optimization that increase the precision of the result but it is not mandatory for the compiler to do that on x86-64 platforms (in fact, not all x86-64 processors support it). Compilers may use separate multiply+add instructions for some reasons (Clang sometime does that). A compiler can pre-compute some constants using a better precision than the target processor (GCC for example operate on FP numbers with a much higher precision to generate compile-time constants). Additionally, different rounding methods could be used to compute constants.

Regarding the hardware vendor, as the default rounding mode can change from one platform to another. In your case, the very small difference may be due to that. The rounding mode could be "Round to nearest, ties to even" on one platform and "Round to nearest, ties away from zero" on the other platform, resulting in a very small, but visible, difference. You can set the rounding mode using the C code provided in this answer. Note also that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more informations) though it makes the results not IEEE-754 compliant. You should check if this is the case.

Put it shortly, difference <1 ULP is totally normal between two IEEE-754-compliant platforms and it is actually quite frequent between very different platforms (eg. Clang on POWER vs GCC on x86-64).

Using ```FE_TOWARDZERO``` fixed ulp to 0.3 on both cpus. ```FE_TONEAREST``` gives 0.161 on fx8150, 0.154 on godbolt.org. The more precise hardware rounds to the really closer one or the other rounds to wrong side. Since zero is same side for both, it works same but ulps nearly doubles, is this due to rounding 1.99 to 1 I guess? — huseyin tugrul buyukisik, May 07 '22 at 13:00

Do FMA instructions of different CPUs have different intermediate accuracy? If yes, then how does a compiler equalize the floating-point behavior?

1 Answers1