Under what conditions does a C++ compiler use floating-point pipelines to do integer division with run-time-known values for higher performance?

Question

For example,

#include<cstdint>

void divTest1(int *  const __restrict__  val1, int *  const __restrict__  val2, int *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        val3[i] = val1[i]/val2[i]; // scalar idiv
    }
}

void divTest2(uint32_t *  const __restrict__  val1, uint32_t *  const __restrict__  val2, uint32_t *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        val3[i] = val1[i]/val2[i]; // scalar div
    }
}

void divTest3(uint32_t *  const __restrict__  val1, uint32_t *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        val3[i] = val1[i]/(uint32_t)10; // shifting + permutation + multiplication
    }
}

void divTest4ExplicitFloatOptimization(int *  const __restrict__  val1, int *  const __restrict__  val2, int *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        val3[i] = ((float)val1[i])/((float)val2[i]); // divps
    }
}

void divTest5ExplicitFloatOptimization(int * const __restrict__ val1, int * const __restrict__ val3)
{
    for(int i=0;i<1024;i++)
    {
        val3[i] = ((float)val1[i])/10.0f; // divps (or *0.1f)
    }
}

// lowering bits to 16
void divTest6(uint16_t *  const __restrict__  val1, uint16_t *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        val3[i] = val1[i]/(uint16_t)10; // shifting + multiplication
    }
}

// lowering bits to 8
void divTest7(uint8_t *  const __restrict__  val1, uint8_t *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        val3[i] = val1[i]/(uint8_t)10; //  still no float pipeline
    }
}

Out of all functions above, only the two with explicit floating-point casting functions force the compiler to do floating-point division SIMD instructions and looks like it may not be doing the same work as idiv for all values of inputs.

Is there a way to tell the compiler that it can assume the integers to be computed are in acceptable range for use of floating-point math? Something like:

int val = std::assume_30bits_precision(ptr[i]);
int var = std::assume_30bits_precision(ptr2[i]);
ptrInt[i] = val / var;

or this:

int val = std::assume_positive_24bits(ptr[i]);
int var = std::assume_positive_24bits(ptr2[i]);
ptrInt[i] = val / var;

If there is no such way, can I use 1-2 Newton-Raphson or similar methods with floats to do same thing but faster than idiv (non-compile-time-known integer values dividing non-compile-time-known integer values) that always does one operation at a time?

Can the float-to-int and int-to-float casting/conversion latency be hidden behind the said method(i.e. Newton Raphson)'s instructions' latency?

You said "pipelines". On some CPUs, integer and FP division already do use the same execution unit: [Do FP and integer division compete for the same throughput resources on x86 CPUs?](https://stackoverflow.com/q/58422171) of course, the part of that execution unit that only handles integer division is only scalar, not replicated for SIMD elements like the parts that are shared with (the low element of) SIMD FP division. I know that's not what you meant, and is unrelated to the compiler using faster SIMD FP division for integer math. — Peter Cordes, May 02 '22 at 13:59
If an int/float mixed operation is benchmarked, then integer operations are count as floats in GFLOPs measurement if they both go into same execution unit? Are some benchmarks cheating in measurement? Or only floats are accepted for measurement at all cost? — huseyin tugrul buyukisik, May 02 '22 at 14:00
I don't expect compilers ever automatically do this optimization. If you want it, you should do it manually. — Peter Cordes, May 02 '22 at 14:00
If you're talking about the HW performance counters for `fp_arith_inst_retired.256b_packed_single` (which count FMA as 2 despite the name), then no, integer division doesn't count, it just competes for the same execution port like `pmuludq` does. (The fact that the divider isn't fully pipelined makes it a bit different from others, since it can delay other divisions starting for multiple cycles.) If you're talking about manually calculating GFLOPs, you can define your terminology however you like, but I don't think many people would consider integer division as an FP operation. — Peter Cordes, May 02 '22 at 14:05
*Can the float-to-int and int-to-float casting/conversion latency be hidden behind the said method(i.e. Newton Raphson)'s instructions' latency?* - obviously not; the NR iteration can't start until it has FP values to work with, other than loading constants. Hiding latency requires *independent* work, e.g. doing this for other array elements. If your dependency chains aren't too long, out-of-order exec can do this for you. — Peter Cordes, May 02 '22 at 14:09
If you come up with evidence that this is a faster than actual `div` or `idiv` scalar instructions but still gives bit-exact correctness, it's possible compilers would want to start auto-vectorizing this way. But only if they can prove value-range guarantees that let them convert to float without loss. (If still profitable or `double` on some CPUs, that would allow full-range 32-bit integers.) — Peter Cordes, May 02 '22 at 14:25
This could make integer math dependent on the FP rounding mode, and would require `-fno-trapping-math` because this could introduce new FP exceptions that didn't exist in the source. (e.g. division by 0 producing "invalid", and likely producing the precision exception for inexact results, typically repeating fractional parts.) So there are a lot of reasons compilers wouldn't want to do it, except maybe with `-Ofast`. But traditionally that's only meant `-O3 -ffast-math`, not other possible weirdness. — Peter Cordes, May 02 '22 at 14:27
FP has 23 mantissa bits. So it should be ok to work with up to 2^23 positive values? (with bit-level hacking to directly work on mantissa before doing division/multiplication?) — huseyin tugrul buyukisik, May 02 '22 at 14:35
Yes, I think so, although I'm not sure about the possibility of an FP result rounding upward instead of getting truncated, if the current FP rounding mode is set to upward instead of the default nearest-even. It's only a problem if it rounds up all the way to the next integer, though, so conversion back to integer with truncation couldn't still get the right result. Either the divisor is `1` or `2` so the result is exact, or the divisor is at least `3` so the quotient is at least one exponent value closer to 0, and thus has some mantissa bits for a fractional part. But IDK if that's enough. — Peter Cordes, May 02 '22 at 15:02

huseyin tugrul buyukisik · Answer 1 · 2022-05-03T08:45:22.010

It seems like one has to create own custom emulator like below sample. (direct bitwise-conversion (in intMagic function) from an integer to float gives partially right answer for low-range integers like 1 to 1023 or up to mantissa bits, but still evades 1 conversion/casting latency) (this is just as a sample, not for exact work):

https://godbolt.org/z/x4c5dGndz

#include<cstdint>
#include<cmath>
#include<cstring>
#include <stdint.h>  // <cstdint> is preferred in C++, but stdint.h works.

#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    uint64_t tsc = __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
    return tsc;
}

void intTest(uint32_t *  const __restrict__  val1, uint32_t *  const __restrict__  val2, uint32_t *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        val3[i] = val1[i]/val2[i]; // scalar idiv
    }
}

void intEmulationTest(uint32_t *  const __restrict__  val1, uint32_t *  const __restrict__  val2, uint32_t *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        double v1 = val1[i];
        double v2 = val2[i];
        double v3 = v1/v2; 
        double t = v3 - (uint32_t)v3;
        v3 += t<0.99?0.01:0.0;
        val3[i] = v3;   // 42-instruction code-bloat 2x faster than 1 idiv >:c
    }
}

// writing bits of integer
// directly to bits of mantissa
// up to 23 bits shoul be ok
// do not use ffast-math, flushes this denormal to zero!!
// "fp rounding mode: truncation" is required
// and do no divide by zero
// warning: 10x speedup in Zen2 architecture
void intMagicTest(uint32_t *  const __restrict__  val1, uint32_t *  const __restrict__  val2, uint32_t *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        float v1;
        float v2; 
        std::memcpy(
          &v1, //mantissa dest
          &val1[i], //23 least significant bits src
          sizeof(float) // write all bytes anyway. Assume float is 4 bytes as uint32_t!
        );
        std::memcpy(&v2,&val2[i],sizeof(float));

        // I don't know how to de-normalize a float 
        //   (result of v1/v2)
        //   (so just let compiler convert it)
        // if de-normalization was possible
        //   then this could have no conversion latency at all
        val3[i] = v1/v2; // vdivps with only 1 conversion
    }
}

// writing bits of 32 integer (but in 64bit storage)
// directly to bits of mantissa of double (53 bits enough?)
// do not use ffast-math, flushes this denormal to zero!!
// "fp rounding mode: truncation" is required
// and do no divide by zero
// warning: 10x speedup in Zen2 architecture
void intMagicTestDouble(uint64_t *  const __restrict__  val1, uint64_t *  const __restrict__  val2, uint64_t *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        double v1;
        double v2; 
        std::memcpy(
          &v1, //mantissa dest
          &val1[i], //53 least significant bits src
          sizeof(double) // write all bytes anyway. Assume float is 8 bytes as uint64_t!
        );
        std::memcpy(&v2,&val2[i],sizeof(double));

        // I don't know how to de-normalize a float 
        //   (result of v1/v2)
        //   (so just let compiler convert it)
        // if de-normalization was possible
        //   then this could have no conversion latency at all
        val3[i] = v1/v2; // vdivps with only 1 conversion
    }
}

// writing bits of 32 integer (using temporary 64bit storage)
// directly to bits of mantissa of double (53 bits enough?)
// do not use ffast-math, flushes this denormal to zero!!
// "fp rounding mode: truncation" is required
// and do no divide by zero
// warning: 10x speedup in Zen2 architecture
void intMagicTestDoubleTmp(uint32_t *  const __restrict__  val1, uint32_t *  const __restrict__  val2, uint32_t *  const __restrict__  val3)
{
    for(int i=0;i<1024;i++)
    {
        uint64_t tmp1 = val1[i];
        uint64_t tmp2 = val2[i];
        double v1;
        double v2; 
        std::memcpy(
          &v1, //mantissa dest
          &tmp1, //53 least significant bits src
          sizeof(double) // write all bytes anyway. Assume float is 8 bytes as uint64_t!
        );
        std::memcpy(&v2,&tmp2,sizeof(double));

        // I don't know how to de-normalize a float 
        //   (result of v1/v2)
        //   (so just let compiler convert it)
        // if de-normalization was possible
        //   then this could have no conversion latency at all
        val3[i] = v1/v2; // vdivps with only 1 conversion
    }
}
#include <iostream>
int main()
{

    uint32_t a[1024],b[1024],c[1024];
    for(int i=0;i<1024;i++)
    {
        a[i]=1+i*i; b[i]=1+i;
    }
    uint64_t a64[1024],b64[1024],c64[1024];
    for(int i=0;i<1024;i++)
    {
        a64[i]=1+i*i; b64[i]=1+i;
    }   
    std::cout<<"emulation:"<<std::endl;
    auto t1 = readTSC() ;
    intEmulationTest(a,b,c);
    auto t2 = readTSC() ;
    for(int i=0;i<10;i++) 
        std::cout<<c[i]<<" "<<std::endl;
    std::cout<<"magic:"<<std::endl;
    auto t3 = readTSC() ;
    intMagicTest(a,b,c);
    auto t4 = readTSC() ;
    for(int i=0;i<10;i++) 
        std::cout<<c[i]<<" "<<std::endl;
    std::cout<<"int:"<<std::endl;
    auto t5 = readTSC() ;
    intTest(a,b,c);
    auto t6 = readTSC() ;  
    for(int i=0;i<10;i++) 
        std::cout<<c[i]<<" "<<std::endl;
    std::cout<<"magic double:"<<std::endl;
    auto t7 = readTSC() ;
    intMagicTestDouble(a64,b64,c64);
    auto t8 = readTSC() ;  
    for(int i=0;i<10;i++) 
        std::cout<<c[i]<<" "<<std::endl;   
    std::cout<<"magic double tmp:"<<std::endl;             
    auto t9 = readTSC() ;
    intMagicTestDoubleTmp(a,b,c);
    auto t10 = readTSC() ;  
    for(int i=0;i<10;i++) 
        std::cout<<c[i]<<" "<<std::endl;                 
    std::cout<<"emulation: "<<t2-t1<<" cycles"<<std::endl;
    std::cout<<"magic: "<<t4-t3<<" cycles"<<std::endl;
    std::cout<<"int: "<<t6-t5<<" cycles"<<std::endl;
    std::cout<<"magic double: "<<t8-t7<<" cycles"<<std::endl;
    std::cout<<"magic double tmp: "<<t10-t9<<" cycles"<<std::endl;
    return 0;
}

output on godbolt.org:

emulation: 7784 cycles <-- should be ok for positive values only, needs more corner-case checking maybe
magic: 1708 cycles  <-- not performance-portable (denormals), only 23 bits
int: 16576 cycles
magic double: 11844 cycles <-- not performance-portable
magic double tmp: 5432 cycles  <-- not performance-portable

To make up for the conversion overhead & bit-level hacking, the SIMD hardware needs to be much wider like 8192 bits or 16384 bits and maybe only then it becomes more appealing for compilers to vectorize the unknown integer / unknown integer division through FP SIMD (100 instructions to check every corner case but running on 256 pipelines ~2.5x speedup).

GPU hardware has 32 pipelines per warp/wavefront and up to 192 pipelines per core. Maybe it is usable in there but looks like it's not much of a gain for x86 CPU even with AVX512 (at least for general-purpose use that requires full 32bit precision). For very low precision integer math, one can simply use floats everywhere too (assuming that corner cases are not a problem).

CPU Type: AMD EPYC 7R32 (GCC v11)
emulation: 8260 cycles
magic: 1904 cycles
int: 15708 cycles (this was compiled with uint64_t)
magic double: 12544 cycles
magic double tmp: 6188 cycles

CPU Type: AMD FX(tm)-8150 Eight-Core Processor (GCC v10)          
emulation: 20687 cycles
magic: 67583 cycles
int: 32914 cycles
int: 31135 cycles (this was compiled with uint64_t)
magic double: 615307 cycles
magic double tmp: 141889 cycles

CPU Type: Intel(R) Xeon(R) E-2286G CPU @ 4.00GHz
emulation: 9964 cycles
magic: 138052 cycles
int: 6477 cycles
int: 19016 cycles (this was compiled with uint64_t)
magic double: 141443 cycles
magic double tmp: 137180 cycles

CPU Type:       Intel(R) Xeon(R) CPU E3-1270 V2 @ 3.50GHz
emulation: 18282 cycles
magic: 210486 cycles
int: 14436 cycles
int: 33604 cycles (this was compiled with uint64_t)
magic double: 225920 cycles
magic double tmp: 217520 cycles

CPU Type: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
emulation: 39483 cycles
magic: 153666 cycles
int: 33746 cycles (this was compiled with uint64_t)
magic double: 158076 cycles
magic double tmp: 159813 cycles

CPU Type: AMD Opteron(tm) Processor 4332 HE              
emulation: 18633 cycles
magic: 114682 cycles
int: 16280 cycles
int: 31070 cycles (this was compiled with uint64_t)
magic double: 504295 cycles
magic double tmp: 104919 cycles

CPU Type: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
emulation: 3448 cycles <--- avx512:
magic: 13296 cycles
int: 7676 cycles
int: 84110 cycles  (this was compiled with uint64_t)
magic double: 178162 cycles
magic double tmp: 27662 cycles

Your `intMagicTest` seems to be using `vdivps` on integer bit-patterns directly. That would work for small integers (<2^23), which are the bit-patterns for subnormal floats. It would be a good idea to comment that code to explain what the magic trick is. Or maybe even works for `[0, 2^25-1]`, where `bit-pattern * 0.5` works as [chtz points out on another question](https://stackoverflow.com/questions/72072721/avx-divide-m256i-packed-32-bit-integers-by-two-no-avx2/72091965#comment127355217_72072721). — Peter Cordes, May 03 '22 at 04:15
Some CPUs are quite slow at handling subnormals (especially Intel), though, taking a microcode assist for normal x subnormal => subnormal. The Godbolt instance I got just now was running on Zen2 hardware (according to `-fverbose-asm` comments from `-march=native` https://godbolt.org/z/43v6ME45z), so `-march=native` showed "magic" almost 10x faster than "int". (emulation: 4956 cycles / magic: 1708 cycles / int: 15596 cycles). And of course it breaks with `-ffast-math`, where flush-to-zero and denormals-are-zero treat the inputs as 0.0f. — Peter Cordes, May 03 '22 at 04:17
You might need to set the FP rounding mode to truncation, rather than nearest-even, for this bit-pattern hack to round correctly for all cases, never rounding up to the next integer in a corner case. Could also speed up "emulation" a lot over a long-running loop, not needing that fixup hack. — Peter Cordes, May 03 '22 at 04:23
Maybe function can be decorated with compiler specific attributes to protect against global flags. — huseyin tugrul buyukisik, May 03 '22 at 05:05
Linking an executable with `-ffast-math` includes CRT code that sets FTZ and DAZ in MXCSR. Even code in a separate file compiled without `-ffast-math` will be affected by those settings. But yes, you would also need to avoid having FP divide compile to `vrcpps` + Newton iteration if that function itself were compiled with `-ffast-math` with some `-march` settings. — Peter Cordes, May 03 '22 at 05:08
What if its a dynamically loaded dll? I guess its not affected so works fine with any main project compile flag? — huseyin tugrul buyukisik, May 03 '22 at 05:09
Nothing resets MXCSR when calling into a DLL. It's a per-process (actually per-thread) setting, not per linker object, just like any control-register / flag state or the FP environment in general. — Peter Cordes, May 03 '22 at 05:11
If godbolt started using zen2, all the avx512 crashes must be caused by that. I thought compiler was broken for avx512. How can 64bit fp code bloat win against 1 idiv? — huseyin tugrul buyukisik, May 03 '22 at 05:20
Magic with double variables using temporary 64bit integers not only vectorized but also faster than emulation. — huseyin tugrul buyukisik, May 03 '22 at 06:29

Under what conditions does a C++ compiler use floating-point pipelines to do integer division with run-time-known values for higher performance?

1 Answers1

Linked