18

I am currently looking into ways of using the fast single-precision floating-point reciprocal capability of various modern processors to compute a starting approximation for a 64-bit unsigned integer division based on fixed-point Newton-Raphson iterations. It requires computation of 264 / divisor, as accurately as possible, where the initial approximation must be smaller than, or equal to, the mathematical result, based on the requirements of the following fixed-point iterations. This means this computation needs to provide an underestimate. I currently have the following code, which works well, based on extensive testing:

#include <stdint.h> // import uint64_t
#include <math.h> // import nextafterf()

uint64_t divisor, recip;
float r, s, t;

t = uint64_to_float_ru (divisor); // ensure t >= divisor
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; // underestimate of 2**64 / divisor 

While this code is functional, it isn't exactly fast on most platforms. One obvious improvement, which requires a bit of machine-specific code, is to replace the division r = 1.0f / t with code that makes use of a fast floating-point reciprocal provided by the hardware. This can be augmented with iteration to produce a result that is within 1 ulp of the mathematical result, so an underestimate is produced in the context of the existing code. A sample implementation for x86_64 would be:

#include <xmmintrin.h>
/* Compute 1.0f/a almost correctly rounded. Halley iteration with cubic convergence */
inline float fast_recip_f32 (float a)
{
    __m128 t;
    float e, r;
    t = _mm_set_ss (a);
    t = _mm_rcp_ss (t);
    _mm_store_ss (&r, t);
    e = fmaf (r, -a, 1.0f);
    e = fmaf (e, e, e);
    r = fmaf (e, r, r);
    return r;
}

Implementations of nextafterf() are typically not performance optimized. On platforms where there are means to quickly re-interprete an IEEE 754 binary32 into an int32 and vice versa, via intrinsics float_as_int() and int_as_float(), we can combine use of nextafterf() and scaling as follows:

s = int_as_float (float_as_int (r) + 0x1fffffff);

Assuming these approaches are possible on a given platform, this leaves us with the conversions between float and uint64_t as major obstacles. Most platforms don't provide an instruction that performs a conversion from uint64_t to float with static rounding mode (here: towards positive infinity = up), and some don't offer any instructions to convert between uint64_t and floating-point types, making this a performance bottleneck.

t = uint64_to_float_ru (divisor);
r = fast_recip_f32 (t);
s = int_as_float (float_as_int (r) + 0x1fffffff);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */

A portable, but slow, implementation of uint64_to_float_ru uses dynamic changes to FPU rounding mode:

#include <fenv.h>
#pragma STDC FENV_ACCESS ON

float uint64_to_float_ru (uint64_t a)
{
    float res;
    int curr_mode = fegetround ();
    fesetround (FE_UPWARD);
    res = (float)a;
    fesetround (curr_mode);
    return res;
}

I have looked into various splitting and bit-twiddling approaches to deal with the conversions (e.g. do the rounding on the integer side, then use a normal conversion to float which uses the IEEE 754 rounding mode round-to-nearest-or-even), but the overhead this creates makes this computation via fast floating-point reciprocal unappealing from a performance perspective. As it stands, it looks like I would be better off generating a starting approximation by using a classical LUT with interpolation, or a fixed-point polynomial approximation, and follow those up with a 32-bit fixed-point Newton-Raphson step.

Are there ways to improve the efficiency of my current approach? Portable and semi-portable ways involving intrinsics for specific platforms would be of interest (in particular for x86 and ARM as the currently dominant CPU architectures). Compiling for x86_64 using the Intel compiler at very high optimization (/O3 /QxCORE-AVX2 /Qprec-div-) the computation of the initial approximation takes more instructions than the iteration, which takes about 20 instructions. Below is the complete division code for reference, showing the approximation in context.

uint64_t udiv64 (uint64_t dividend, uint64_t divisor)
{
    uint64_t temp, quot, rem, recip, neg_divisor = 0ULL - divisor;
    float r, s, t;

    /* compute initial approximation for reciprocal; must be underestimate! */
    t = uint64_to_float_ru (divisor);
    r = 1.0f / t;
    s = 0x1.0p64f * nextafterf (r, 0.0f);
    recip = (uint64_t)s; /* underestimate of 2**64 / divisor */

    /* perform Halley iteration with cubic convergence to refine reciprocal */
    temp = neg_divisor * recip;
    temp = umul64hi (temp, temp) + temp;
    recip = umul64hi (recip, temp) + recip;

    /* compute preliminary quotient and remainder */
    quot = umul64hi (dividend, recip); 
    rem = dividend - divisor * quot;

    /* adjust quotient if too small; quotient off by 2 at most */
    if (rem >= divisor) quot += ((rem - divisor) >= divisor) ? 2 : 1;

    /* handle division by zero */
    if (divisor == 0ULL) quot = ~0ULL;

    return quot;
}

umul64hi() would generally map to a platform-specific intrinsic, or a bit of inline assembly code. On x86_64 I currently use this implementation:

inline uint64_t umul64hi (uint64_t a, uint64_t b)
{
    uint64_t res;
    __asm__ (
        "movq  %1, %%rax;\n\t"  // rax = a
        "mulq  %2;\n\t"         // rdx:rax = a * b
        "movq  %%rdx, %0;\n\t"  // res = (a * b)<63:32>
        : "=rm" (res)
        : "rm"(a), "rm"(b)
        : "%rax", "%rdx");
    return res;
}
phuclv
  • 37,963
  • 15
  • 156
  • 475
njuffa
  • 23,970
  • 4
  • 78
  • 130
  • Given that floating point reciprocal is an obvious and common operation, shouldn't your compiler be smart enough to emit optimized code for it, assuming your ISA supports it and you have told the compiler so? – John Zwinck Apr 26 '16 at 01:41
  • @JohnZwinck Maybe :-) Usually it involves fiddling with compiler switches, which then negatively affect other code in undesired ways. Intrinsics are fine, they can often be abstracted away into a set of "generic intrinsics" that map closely to platform-specific ones (see the SIMD source code for GROMACS as a worked example). In any event, the floating-point reciprocal isn't really my problem here, the conversions are killing my approach (except on GPUs). – njuffa Apr 26 '16 at 01:57
  • Did you benchmark? How? Which target details? Which toolchain? What was the outcome? Why do you think "fiddling with compiler switches" is not required for your code? If you want full control of the generated code, you eventually have to use Assembler. – too honest for this site Apr 26 '16 at 02:04
  • @Olaf: This is exploratory work ideally applicable to *multiple* platforms. Eventually may go down to assembly language level, but premature now (focus on algorithm). At present using Intel compiler on an x86_64 platform to build the code (`/O3, /QxHOST`). One look at generated assembly code was enough to convince me that this initial approximation lacks efficiency (the NR iterations are fine). Way too many instructions, many related to splitting `uint64_t` for the conversions, it seems. On an NVIDIA GPU, using intrinsics, this approach can map to about five instructions or so and is useable – njuffa Apr 26 '16 at 02:33
  • "must be strictly smaller than mathematical result," seems like it should be "must be strictly smaller than _or equal to_ the mathematical result," – chux - Reinstate Monica Apr 26 '16 at 16:38
  • @chux Best I can tell, it *cannot* be equal or the iteration will fail to produce the correct result in some cases. This is at least true for the dual Newton-Raphson iterations I used, I have not tried it yet with the single Halley iteration (cubic convergence) I just switched to five minutes ago. I will double check whether "smaller or equal" is sufficient and adjust the question if necessary. – njuffa Apr 26 '16 at 17:01
  • @chux You are right, it should be *smaller than or equal to* the mathematical result, which also makes intuitive sense.Thanks for bringing that to my attention. I was mislead by comparing to a number that was close to, but not always identical to the mathematical result. – njuffa Apr 26 '16 at 17:38
  • Doesn't X86 has instructions for this? – user3528438 Apr 26 '16 at 18:20
  • Like ARM has instructions to calculate both initial guess and interation http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf page A2-85 . – user3528438 Apr 26 '16 at 18:25
  • 1
    Also similar question here: http://stackoverflow.com/questions/35063224/trick-to-divide-a-constant-power-of-two-by-an-integer/35096198#35096198 – user3528438 Apr 26 '16 at 18:26
  • @user3528438 Yes, x86 has a 64-bit division instruction, but it is not terribly fast. I am currently trying to find out whether integer division based on fast integer multiplies can be made competitive both with built-in instructions and alternative approaches of computing integer division, with the ultimate goal of fully pipelined, and vectorizable code. At minimum such an approach requires a very fast initial approximation, which is why I would like to use the reciprocal approximation capability of modern processors. I am familiar with the LUT-based approach as used in the related question. – njuffa Apr 26 '16 at 18:31
  • Are you able to put bounds on the range of numbers you are working with? – LogicG8 May 20 '16 at 00:09
  • @LogicG8 I am not sure where you are heading with this question. Since this is in the context of 64-bit unsigned integer division, the divisor passed to the reciprocal computation is in [1, 2**64). That is not a problem for the floating-point reciprocal approximation instructions implemented by modern processor. It does seem to preclude the use of well-known tricks for conversions between integer and floating-point space, in particular the "magic number addition". – njuffa May 20 '16 at 02:05
  • @njuffa That is where I was headed. – LogicG8 May 20 '16 at 13:57
  • not exactly reciprocal, but you can calculate the initial approximation as `(-n)/n + 1` [How to compute 2⁶⁴/n in C?](https://stackoverflow.com/q/55565537/995714) – phuclv May 31 '19 at 15:30
  • The `umul64hi` inline asm could avoid both `mov` instructions by using `"+a"(a), "=d"(res) : "rm"(b)` constraints to tell the compiler about the fixed registers. Possibly with more work even teach it that `a` and `b` are commutative, maybe with manually separating `+a` into an `a` input and a separate dummy output. – Peter Cordes Aug 15 '21 at 10:13
  • @PeterCordes As I recall, I experimented with all these constraint variations at the time and the code broke. Possibly a compiler bug, but I could not change the compiler on the machine I was using. – njuffa Aug 15 '21 at 13:45

1 Answers1

2

This solution combines two ideas:

  • You can convert to floating point by simply reinterpreting the bits as floating point and subtracting a constant, so long as the number is within a particular range. So add a constant, reinterpret, and then subtract that constant. This will give a truncated result (which is therefore always less than or equal the desired value).
  • You can approximate reciprocal by negating both the exponent and the mantissa. This may be achieved by interpreting the bits as int.

Option 1 here only works in a certain range, so we check the range and adjust the constants used. This works in 64 bits because the desired float only has 23 bits of precision.

The result in this code will be double, but converting to float is trivial, and can be done on the bits or directly, depending on hardware.

After this you'd want to do the Newton-Raphson iteration(s).

Much of this code simply converts to magic numbers.

double                                                       
u64tod_inv( uint64_t u64 ) {                                 
  __asm__( "#annot0" );                                      
  union {                                                    
    double f;                                                
    struct {                                                 
      unsigned long m:52; // careful here with endianess     
      unsigned long x:11;                                    
      unsigned long s:1;                                     
    } u64;                                                   
    uint64_t u64i;                                           
  } z,                                                       
        magic0 = { .u64 = { 0, (1<<10)-1 + 52, 0 } },        
        magic1 = { .u64 = { 0, (1<<10)-1 + (52+12), 0 } },   
        magic2 = { .u64 = { 0, 2046, 0 } };                  

  __asm__( "#annot1" );                                      
  if( u64 < (1UL << 52UL ) ) {                               
    z.u64i = u64 + magic0.u64i;                              
    z.f   -= magic0.f;                                       
  } else {                                                   
    z.u64i = ( u64 >> 12 ) + magic1.u64i;                    
    z.f   -= magic1.f;                                       
  }                                                          
  __asm__( "#annot2" );                                      

  z.u64i = magic2.u64i - z.u64i;                             

  return z.f;                                                
}                                                            

Compiling this on an Intel core 7 gives a number of instructions (and a branch), but, of course, no multiplies or divides at all. If the casts between int and double are fast this should run pretty quickly.

I suspect float (with only 23 bits of precision) will require more than 1 or 2 Newton-Raphson iterations to get the accuracy you want, but I haven't done the math...

tolkienfan
  • 41
  • 4
  • I don't see the use of a fast floating-point reciprocal. The approach here seems to fall into the category of "fixed-point polynomial approximation" (here: piecewise linear) that I already mentioned as an alternative in my question and possibly relates to [this question](http://stackoverflow.com/questions/32042673/optimized-low-accuracy-approximation-to-rootnx-n). The reason I asked about approach via fast floating-point reciprocal specifically is because it is provided by multiple architectures, yet I can't figure out how to make it practically useful other than on GPUs. – njuffa Oct 14 '16 at 22:25
  • You had mentioned issues with conversion between uint64 and floating point... this handles that. It does the approx reciprocal via the same method you linked to. Since those weren't what you were looking for, and you do know about existing approx reciprocal instructions, I'm not sure what you really want answered. – tolkienfan Oct 16 '16 at 02:22
  • I know about conversion by re-interpretation and use of a magic number (mentioned in comments), and I know how to form a fast reciprocal by integer manipulations. So I am not sure that there is anything here that I haven't tried already. Since I have some time now, I will take a closer look at your code and see how it might plug into the overall division sequence I showed above for full context for my question. If you are so inclined, you could also clarify this plug-in aspect. – njuffa Oct 16 '16 at 04:10
  • Best I can tell from my experiments, `u64tod_inv()` is a low accuracy replacement for `t = uint64_to_float_ru (divisor); r = 1.0f / t;` with relative error of 0.125, requiring three floating-point NR iterations to get a result accurate to single precision. It looks like this could be made to work (is tight underestimation guaranteed for initial `recip`?), but since it does not use fast hardware floating-point reciprocal capabilities (per the question title), this is not the answer I am seeking. – njuffa Oct 16 '16 at 05:37
  • You are correct - it's a low accuracy replacement for 1./t (except that it does the conversions too). Rereading I see that you need the rounding the opposite direction than I initially thought. This code does not round down, but this can be fixed with a multiply (there IS a strict range of relative error). It doesn't seem like you really need a strict underestimate though, do you? – tolkienfan Oct 17 '16 at 14:50
  • As established in the comment trail for the question, the fixed-point Halley iteration requires an initial estimate that is smaller than or equal to the mathematical result. – njuffa Oct 17 '16 at 16:22
  • The Halley iteration code above doesn't converge for me... but replacing "recip = umul64hi (recip, temp) + recip" with "recip = recip - umul64hi (recip, temp)" seems to fix the issue. – tolkienfan Oct 17 '16 at 20:16
  • If the Halley iteration code doesn't converge, that is a good indication that the initial value of `recip` does not satisfy the constraint that it must be a close underestimate of the true mathematical result. This will lead to overflow in the fixed-point iteration from which the code cannot recover. A typical case that breaks without the constraint on the initial approximation is the computation of reciprocal of powers-of-2. In addition to paper analysis, the code I posted was tested with many billions of test vectors, I'd be highly interested if you find a failure in the code *as I wrote it* – njuffa Oct 17 '16 at 20:26
  • I just noticed that your neg_divisor is an unsigned int - was that intended? It was tested as you wrote it with the exception that I used floating point. It never converged. With my modification it converges when the relative error is > 0 and < phi (the golden ratio). The given code is within those bounds. – tolkienfan Oct 17 '16 at 21:02
  • `neg_divisor` is a `uint64_t`, which is as intended and wrap-around behavior is properly defined for unsigned integer types. You *cannot* replace the fixed-point Halley iteration with a floating-point version verbatim, because the fixed-pont computation uses a trick: The high-order bits of the temp results are not carried along, because as long as the initial `recip` is an underestimate, those high-order bits will be zero. Compare with the mathematical definition of the Halley iteration for the reciprocal and it will become clear. – njuffa Oct 17 '16 at 21:22
  • If you replace magic2 with "magic2 = { .u64 = { 3659174697238528UL, 2045, 0 } };" it will always underestimate... it's far from the closest bound - but I haven't calculated that yet. If you're interested, I will. – tolkienfan Oct 17 '16 at 22:26