Efficient floating-point division with constant integer divisors

Question

A recent question, whether compilers are allowed to replace floating-point division with floating-point multiplication, inspired me to ask this question.

Under the stringent requirement, that the results after code transformation shall be bit-wise identical to the actual division operation, it is trivial to see that for binary IEEE-754 arithmetic, this is possible for divisors that are a power of two. As long as the reciprocal of the divisor is representable, multiplying by the reciprocal of the divisor delivers results identical to the division. For example, multiplication by 0.5 can replace division by 2.0.

One then wonders for what other divisors such replacements work, assuming we allow any short instruction sequence that replaces division but runs significantly faster, while delivering bit-identical results. In particular allow fused multiply-add operations in addition to plain multiplication. In comments I pointed to the following relevant paper:

Nicolas Brisebarre, Jean-Michel Muller, and Saurabh Kumar Raina. Accelerating correctly rounded floating-point division when the divisor is known in advance. IEEE Transactions on Computers, Vol. 53, No. 8, August 2004, pp. 1069-1072.

The technique advocated by the authors of the paper precomputes the reciprocal of the divisor y as a normalized head-tail pair z_h:z_l as follows: z_h = 1 / y, z_l = fma (-y, z_h, 1) / y. Later, the division q = x / y is then computed as q = fma (z_h, x, z_l * x). The paper derives various conditions that divisor y must satisfy for this algorithm to work. As one readily observes, this algorithm has problems with infinities and zero when the signs of head and tail differ. More importantly, it will fail to deliver correct results for dividends x that are very small in magnitude, because computation of the quotient tail, z_l * x, suffers from underflow.

The paper also makes a passing reference to an alternative FMA-based division algorithm, pioneered by Peter Markstein when he was at IBM. The relevant reference is:

P. W. Markstein. Computation of elementary functions on the IBM RISC System/6000 processor. IBM Journal of Research & Development, Vol. 34, No. 1, January 1990, pp. 111-119

In Markstein's algorithm, one first computes a reciprocal rc, from which an initial quotient q = x * rc is formed. Then, the remainder of the division is computed accurately with an FMA as r = fma (-y, q, x), and an improved, more accurate quotient is finally computed as q = fma (r, rc, q).

This algorithm also has issues for x that are zeroes or infinities (easily worked around with appropriate conditional execution), but exhaustive testing using IEEE-754 single-precision float data shows that it delivers the correct quotient across all possibe dividends x for many divisors y, among these many small integers. This C code implements it:

/* precompute reciprocal */
rc = 1.0f / y;

/* compute quotient q=x/y */
q = x * rc;
if ((x != 0) && (!isinf(x))) {
    r = fmaf (-y, q, x);
    q = fmaf (r, rc, q);
}

On most processor architectures, this should translate into a branchless sequence of instructions, using either predication, conditional moves, or select-type instructions. To give a concrete example: For division by 3.0f, the nvcc compiler of CUDA 7.5 generates the following machine code for a Kepler-class GPU:

    LDG.E R5, [R2];                        // load x
    FSETP.NEU.AND P0, PT, |R5|, +INF , PT; // pred0 = fabsf(x) != INF
    FMUL32I R2, R5, 0.3333333432674408;    // q = x * (1.0f/3.0f)
    FSETP.NEU.AND P0, PT, R5, RZ, P0;      // pred0 = (x != 0.0f) && (fabsf(x) != INF)
    FMA R5, R2, -3, R5;                    // r = fmaf (q, -3.0f, x);
    MOV R4, R2                             // q
@P0 FFMA R4, R5, c[0x2][0x0], R2;          // if (pred0) q = fmaf (r, (1.0f/3.0f), q)
    ST.E [R6], R4;                         // store q

For my experiments, I wrote the tiny C test program shown below that steps through integer divisors in increasing order and for each of them exhaustively tests the above code sequence against the proper division. It prints a list of the divisors that passed this exhaustive test. Partial output looks as follows:

PASS: 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 15, 16, 17, 19, 21, 23, 25, 27, 29, 31, 32, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 64, 65, 67, 69,

To incorporate the replacement algorithm into a compiler as an optimization, a whitelist of divisors to which the above code transformation can safely be applied is impractical. The output of the program so far (at a rate of about one result per minute) suggests that the fast code works correctly across all possible encodings of x for those divisors y that are odd integers or are powers of two. Anecdotal evidence, not a proof, of course.

What set of mathematical conditions can determine a-priori whether the transformation of division into the above code sequence is safe? Answers can assume that all the floating-point operations are performed in the default rounding mode of "round to nearest or even".

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

int main (void)
{
    float r, q, x, y, rc;
    volatile union {
        float f;
        unsigned int i;
    } arg, res, ref;
    int err;

    y = 1.0f;
    printf ("PASS: ");
    while (1) {
        /* precompute reciprocal */
        rc = 1.0f / y;

        arg.i = 0x80000000;
        err = 0;
        do {
            /* do the division, fast */
            x = arg.f;
            q = x * rc;
            if ((x != 0) && (!isinf(x))) {
                r = fmaf (-y, q, x);
                q = fmaf (r, rc, q);
            }
            res.f = q;
            /* compute the reference, slowly */
            ref.f = x / y;

            if (res.i != ref.i) {
                err = 1;
                break;
            }
            arg.i--;
        } while (arg.i != 0x80000000);

        if (!err) printf ("%g, ", y);
        y += 1.0f;
    }
    return EXIT_SUCCESS;
}

Not sure why the question has been tagged for closure as "too broad". I would appreciate if the downvoter could explain their reasoning. I am trying to determine when it is "safe" to replace floating-point division with a constant integer divisor with a *very specific* code sequence shown in the question. Anecdotal evidence from my test results seems to indicate it works for odd integers, and those that are powers of two. But to propose this as a general purpose optimization, there needs to be solid mathematical reasoning for which integers this is "safe"; I don't have the math skills for that — njuffa, Feb 20 '16 at 20:21
I would expect an answer to this question to list a couple of conditions that must be imposed on the divisor, together with up to a page for justification or derivation, which I would not consider as "too long" for the SO format. The reason I did not ask this question on Mathematics Stackexchange is because floating-point questions hardly get any traction there, while there are a number of mathematicians on Stackoverflow and the question is most definitely related to programming, so IMHO appropriate for the [math] tag here. — njuffa, Feb 20 '16 at 20:32
Does this suggest that you can divide by 3 but not by 6 or 12 with above algorithm? In which case, there's an obvious extension: separate the division by the power of 2, it's just one more multiplication. — aka.nice, Feb 20 '16 at 22:48
@aka.nice Yes. That fact has puzzled me, and I had the same idea of splitting such divisions into two stages. I have not tried it yet, but I think it may not work since division by two is not always exact when the result is a denormal. — njuffa, Feb 20 '16 at 22:57
@aka.nice OK, I tried it and things are pretty much as I thought, concatenation of divisions where one is a power of two can fail if the result is denormal. Example: `x = 0x1.7ffffep-124`. The computation `x/3/2` gives `0x1.000000p-126` but the true quotient is `x/6 = 0x1.fffffcp-127`. — njuffa, Feb 20 '16 at 23:14
Of course, crappy denormals! Never mind, it was a case of short memory http://comments.gmane.org/gmane.comp.lang.smalltalk.vwnc/26468 — aka.nice, Feb 20 '16 at 23:33
This seems more appropriate for the computer science stackexchange, non? — Claudiu, Feb 21 '16 at 19:20
@Claudiu Based on general perusal of Computer Science Stackexchange, search for relevant tags, and checking selected Q&A threads related to floating-point arithmetic on that site, my expectation of a meaningful answer (or even useful comments) would be very low. Since cross-posting seems strongly discouraged in the SO/SE universe, I can't simply perform the relevant experiment to find out one way or the other. — njuffa, Feb 21 '16 at 19:45
@Claudiu I don't think anyone expert in floating-point hangs out on the CS stack exchange, so not really, no. Whereas there are a number of regular contributors here (including njuffa himself) who are quite knowledgeable. — Stephen Canon, Feb 23 '16 at 02:44
@njuffa: If you have an AVX2/FMA3-capable CPU, you can increase the speed eight-fold (per true core). On this particular Core i5-4200U laptop, I can check all positive normals and subnormals for eight different divisors in just under a minute (55s), using one core. It has two real cores, and I just checked: my program can check 16 divisors per 55 secs (or so) on this machine, or 3.5 seconds per divisor. Want the code? — Nominal Animal, Feb 23 '16 at 06:52
I know, but unfortunately I have an older generation Intel CPU and am stuck with emulated FMA (Intel compiler) with my own FMA-emulation as a backup. No correctness issues, but it is fairly slow. I could also use my GPU (lots of single-precision FMA-based FLOPs) but it is occupied with Folding @ Home for another four days or so. — njuffa, Feb 23 '16 at 06:58
You may have a look at [Efficient Floating Point Division for Digital Signal Processing Application](https://www.docdroid.net/aUjzXVA). — Royi, Dec 30 '18 at 05:29
@Royi Thanks for the pointer. To be honest, at first glance I don't see anything particularly novel about the approach in the paper, but I will take a closer look. — njuffa, Dec 30 '18 at 08:38
Very late to the party, but notice that paper assumes unbounded exponents so ignores underflow/overflow & specials. (blackboard M def in section 3) — MB Reynolds, Mar 05 '19 at 13:16

Pascal Cuoq · Answer 1 · 2016-02-21T20:54:35.957

This question asks for a way to identify the values of the constant Y that make it safe to transform x / Y into a cheaper computation using FMA for all possible values of x. Another approach is to use static analysis to determine an over-approximation of the values x can take, so that the generally unsound transformation can be applied in the knowledge that the values for which the transformed code differs from the original division do not happen.

Using representations of sets of floating-point values that are well adapted to the problems of floating-point computations, even a forwards analysis starting from the beginning of the function can produce useful information. For instance:

float f(float z) {
  float x = 1.0f + z;
  float r = x / Y;
  return r;
}

Assuming the default round-to-nearest mode(*), in the above function x can only be NaN (if the input is NaN), +0.0f, or a number larger than 2^-24 in magnitude, but not -0.0f or anything closer to zero than 2^-24. This justifies the transformation into one of the two forms shown in the question for many values of the constant Y.

(*) assumption without which many optimizations are impossible and that C compilers already make unless the program explicitly uses #pragma STDC FENV_ACCESS ON

A forwards static analysis that predicts the information for x above can be based on a representation of sets of floating-point values an expression can take as a tuple of:

a representation for the sets of possible NaN values (Since behaviors of NaN are underspecified, a choice is to use only a boolean, with true meaning some NaNs can be present, and false indicating no NaN is present.),
four boolean flags indicating respectively the presence of +inf, -inf, +0.0, -0.0,
an inclusive interval of negative finite floating-point values, and
an inclusive interval of positive finite floating-point values.

In order to follow this approach, all the floating-point operations that can occur in a C program must be understood by the static analyzer. To illustrate, the addition betweens sets of values U and V, to be used to handle + in the analyzed code, can be implemented as:

If NaN is present in one of the operands, or if the operands can be infinities of opposite signs, NaN is present in the result.
If 0 cannot be a result of the addition of a value of U and a value of V, use standard interval arithmetic. The upper bound of the result is obtained for the round-to-nearest addition of the largest value in U and the largest value in V, so these bounds should be computed with round-to-nearest.
If 0 can be a result of the addition of a positive value of U and a negative value of V, then let M be the smallest positive value in U such that -M is present in V.
- if succ(M) is present in U, then this pair of values contributes succ(M) - M to the positive values of the result.
- if -succ(M) is present in V, then this pair of values contributes the negative value M - succ(M) to the negative values of the result.
- if pred(M) is present in U, then this pair of values contributes the negative value pred(M) - M to the negative values of the result.
- if -pred(M) is present in V, then this pair of values contributes the value M - pred(M) to the positive values of the result.
Do the same work if 0 can be the result of the addition of a negative value of U and a positive value of V.

Acknowledgement: the above borrows ideas from “Improving the Floating Point Addition and Subtraction Constraints”, Bruno Marre & Claude Michel

Example: compilation of the function f below:

float f(float z, float t) {
  float x = 1.0f + z;
  if (x + t == 0.0f) {
    float r = x / 6.0f;
    return r;
  }
  return 0.0f;
}

The approach in the question refuses to transform the division in function f into an alternate form, because 6 is not one of the value for which the division can be unconditionally transformed. Instead, what I am suggesting is to apply a simple value analysis starting from the beginning of the function which, in this case, determines that x is a finite float either +0.0f or at least 2^-24 in magnitude, and to use this information to apply Brisebarre et al's transformation, confident in the knowledge that x * C2 does not underflow.

To be explicit, I am suggesting to use an algorithm such as the one below to decide whether or not to transform the division into something simpler:

Is Y one of the values that can be transformed using Brisebarre et al's method according to their algorithm?
Do C1 and C2 from their method have the same sign, or is it possible to exclude the possibility that the dividend is infinite?
Do C1 and C2 from their method have the same sign, or can x take only one of the two representations of 0? If in the case where C1 and C2 have different signs and x can only be one representation of zero, remember to fiddle(**) with the signs of the FMA-based computation to make it produce the correct zero when x is zero.
Can the magnitude of the dividend be guaranteed to be large enough to exclude the possibility that x * C2 underflows?

If the answer to the four questions is “yes”, then the division can be transformed into a multiplication and an FMA in the context of the function being compiled. The static analysis described above serves to answer questions 2., 3. and 4.

(**) “fiddling with the signs” means using -FMA(-C1, x, (-C2)*x) in place of FMA(C1, x, C2*x) when this is necessary to make the result come out correctly when x can only be one of the two signed zeroes

I am having trouble understanding how the answer relates to the question and now fear I may have miscommunicated the practical aspect of it: When encountering a floating-point division `x / fpconst`, where `fpconst` is an integer and `x` can take any encoding in a `float`, how can one determine if the replacement code delivers identical result to the division, based on `fpconst`? A yes/no result. This may be contained as special case in the generic algorithm above, but I don't recognize where. I don't understand the meaning of "M is present in Y": Y doesn't seem to be an interval? — njuffa, Feb 21 '16 at 14:58
From exhaustive testing on `float` (implemented as IEEE-754 `binary32`) I know that `x/3.0f` can be replaced with the FMA-based sequence, delivering bit-identical results to the division for all possible values of `x` (i.e, result is TRUE). For `x/6.0f`, this is not possible, as the replacement doesn't return correct results when `x` is very small in magnitude (i.e. result is FALSE). *How would one derive these same results based on the procedure in the answer?* Would the procedure be faster than exhaustive test (about a minute per result for `float`)? — njuffa, Feb 21 '16 at 15:09
@njuffa Yes, this answer does not provide the sufficient conditions on the constant `Y` to replace `x / Y` by an alternate form, for instance in the context of a compiler. This answer points out that instead, *in the context of a compiler*, it may be simpler and more effective to compute information about the values of `x` which is there for the taking in order to make it more frequent and simpler to determine that the transformation is correct. I can delete the answer if you consider it is too far off, but I posted it because I thought it addressed the same original problem: compiling `x / Y` — Pascal Cuoq, Feb 21 '16 at 15:10
I am not suggesting you delete the answer. Just because I am personally having trouble understanding it doesn't mean other people cannot understand it. I get the reverse view point of your answer: For a given divisor `fpconst`, determine set of floating-point values `x` for which FMA-based code delivers identical result to the division. I can see how approach from that direction can be advantageous if range information on `x` already exists. From talks with compiler folks I know that frequently, for floating-point, there is no range information, `x` can be any `float` encoding. — njuffa, Feb 21 '16 at 15:29
@njuffa Exactly, this is why the part of my answer that I initially developed most is how to implement a value analysis that usefully answers the questions that arise about the values taken by the dividend when trying to reason about the opportunity to simplify the division. — Pascal Cuoq, Feb 21 '16 at 15:37
@njuffa I am hoping that the floating-point value analysis outlined above will be implemented about three months from now. I will drop you a line when it is. — Pascal Cuoq, Feb 21 '16 at 15:48

score 7 · Accepted Answer · edited Jun 20 '20 at 09:12

Let me restart for the third time. We are trying to accelerate

    q = x / y

where y is an integer constant, and q, x, and y are all IEEE 754-2008 binary32 floating-point values. Below, fmaf(a,b,c) indicates a fused multiply-add a * b + c using binary32 values.

The naive algorithm is via a precalculated reciprocal,

    C = 1.0f / y

so that at runtime a (much faster) multiplication suffices:

    q = x * C

The Brisebarre-Muller-Raina acceleration uses two precalculated constants,

    zh = 1.0f / y
    zl = -fmaf(zh, y, -1.0f) / y

so that at runtime, one multiplication and one fused multiply-add suffices:

    q = fmaf(x, zh, x * zl)

The Markstein algorithm combines the naive approach with two fused multiply-adds that yields the correct result if the naive approach yields a result within 1 unit in the least significant place, by precalculating

    C1 = 1.0f / y
    C2 = -y

so that the divison can be approximated using

    t1 = x * C1
    t2 = fmaf(C1, t1, x)
    q  = fmaf(C2, t2, t1)

The naive approach works for all powers of two y, but otherwise it is pretty bad. For example, for divisors 7, 14, 15, 28, and 30, it yields an incorrect result for more than half of all possible x.

The Brisebarre-Muller-Raina approach similarly fails for almost all non-power of two y, but much fewer x yield the incorrect result (less than half a percent of all possible x, varies depending on y).

The Brisebarre-Muller-Raina article shows that the maximum error in the naive approach is ±1.5 ULPs.

The Markstein approach yields correct results for powers of two y, and also for odd integer y. (I have not found a failing odd integer divisor for the Markstein approach.)

For the Markstein approach, I have analysed divisors 1 - 19700 (raw data here).

Plotting the number of failure cases (divisor in the horizontal axis, the number of values of x where Markstein approach fails for said divisor), we can see a simple pattern occur:

_{(source: nominal-animal.net)}

Note that these plots have both horizontal and vertical axes logarithmic. There are no dots for odd divisors, as the approach yields correct results for all odd divisors I've tested.

If we change the x axis to the bit reverse (binary digits in reverse order, i.e. 0b11101101 → 0b10110111, data) of the divisors, we have a very clear pattern:
_{(source: nominal-animal.net)}

If we draw a straight line through the center of the point sets, we get curve 4194304/x. (Remember, the plot considers only half the possible floats, so when considering all possible floats, double it.) 8388608/x and 2097152/x bracket the entire error pattern completely.

Thus, if we use rev(y) to compute the bit reverse of divisor y, then 8388608/rev(y) is a good first order approximation of the number of cases (out of all possible float) where the Markstein approach yields an incorrect result for an even, non-power-of-two divisor y. (Or, 16777216/rev(x) for the upper limit.)

Added 2016-02-28: I found an approximation for the number of error cases using the Markstein approach, given any integer (binary32) divisor. Here it is as pseudocode:

function markstein_failure_estimate(divisor):
    if (divisor is zero)
        return no estimate
    if (divisor is not an integer)
        return no estimate

    if (divisor is negative)
        negate divisor

    # Consider, for avoiding underflow cases,
    if (divisor is very large, say 1e+30 or larger)
        return no estimate - do as division

    while (divisor > 16777216)
        divisor = divisor / 2

    if (divisor is a power of two)
        return 0

    if (divisor is odd)
        return 0

    while (divisor is not odd)
        divisor = divisor / 2

    # Use return (1 + 83833608 / divisor) / 2
    # if only nonnegative finite float divisors are counted!
    return 1 + 8388608 / divisor

This yields a correct error estimate to within ±1 on the Markstein failure cases I have tested (but I have not yet adequately tested divisors larger than 8388608). The final division should be such that it reports no false zeroes, but I cannot guarantee it (yet). It does not take into account very large divisors (say 0x1p100, or 1e+30, and larger in magnitude) which have underflow issues -- I would definitely exclude such divisors from acceleration anyway.

In preliminary testing, the estimate seems uncannily accurate. I did not draw a plot comparing the estimates and the actual errors for divisors 1 to 20000, because the points all coincide exactly in the plots. (Within this range, the estimate is exact, or one too large.) Essentially, the estimates reproduce the first plot in this answer exactly.

The pattern of failures for the Markstein approach is regular, and very interesting. The approach works for all power of two divisors, and all odd integer divisors.

For divisors greater than 16777216, I consistently see the same errors as for a divisor that is divided by the smallest power of two to yield a value less than 16777216. For example, 0x1.3cdfa4p+23 and 0x1.3cdfa4p+41, 0x1.d8874p+23 and 0x1.d8874p+32, 0x1.cf84f8p+23 and 0x1.cf84f8p+34, 0x1.e4a7fp+23 and 0x1.e4a7fp+37. (Within each pair, the mantissa is the same, and only the power of two varies.)

Assuming my test bench is not in error, this means that the Markstein approach also works divisors larger than 16777216 in magnitude (but smaller than, say, 1e+30), if the divisor is such that when divided by the smallest power of two that yields a quotient of less than 16777216 in magnitude, and the quotient is odd.

I maybe be too tired right now but I am having trouble figuring out what this means: "Larger than 16777216, such that when divided by the smallest power of two where the quotient is less than 16777216, the quotient is odd". Could you describe that mathematically? BTW, I spent two days looking at divisors over 2**24 but couldn't work out a pattern which ones work. Please note that what you refer to as the "Brisebarre-Muller-Raina" algorithm above is "Algorithm 1 (division with on multiplication and two fused-macs" from their paper and attributed to *Markstein* by them (jibes with references) — njuffa, Feb 27 '16 at 03:15
Is this a counter example to your third rule? Divisor is`y`: For `y=33554334 y/2**n=16777167 (y/2**n)&1=1` Markstein FMA-based division fails to deliver correct result `y=0x1.ffff9ep+24 arg=0x1.1f589ap-101 (0d0fac4d) res=0x1.1f58d0p-126 (008fac68) ref=0x1.1f58d2p-126 (008fac69)` — njuffa, Feb 27 '16 at 03:55
On my sm_50 GPU, with CUDA 7.5, I get: division = `0x1.1f589ap-101 / 0x1.ffff9ep+24 = 0x1.1f58d2p-126. Markstein: residual=-0x1.ffff9cp-126 final_quot=0x1.1f58d2p-126`. The Markstein sequence works for this divisor, as the results match. I notice belatedly that I unintentionally left off the `/fp:strict` flag for the Intel compiler, that is likely the cause for the earlier mismatch on the CPU. Sorry for the confusion, will investigate further. — njuffa, Feb 27 '16 at 06:31
`/fp:strict` didn't help, the issue seems to be incorrect emulation of `fmaf()`. Bummer. Never encountered that before, I could have sworn Intel's emulation is rock solid. Apparently not. My own `fmaf()` emulation makes this test vector pass, but is too slow for exhaustive testing. No wonder I couldn't find a rule for divisors > 2**24, I was thrown off track by artifacts caused by bad FMA emulation. Will switch to the GPU (hardware FMA). — njuffa, Feb 27 '16 at 06:55
What is the motivation / reason behind the bit-reversal computation for estimating the failure rate? — njuffa, Feb 27 '16 at 08:34
@njuffa: I've rewritten my answer with (hopefully) correct names and terms, including a preamble to ensure you and other readers know what I am referring to. — Nominal Animal, Feb 27 '16 at 12:56
Your graphs are great! (would upvote again). However I find your computation of `zl` for Brisebarre et al's method surprising. In intuitive terms, `zl` should be the next bits of the mathematical reciprocal 1/Y, and it seems to me that the way you are computing it is off by a factor of Y or something. The way I computed `zh` and `zl`, since I am blessed with a compiler that gives me 80-bit `long double` values, was `zh = 1.0f / C; zl = 1.0L / C - zh;`. **(Note: I am reading the description of that algorithm from section 5.5 of the Handbook of floating-point arithmetic, not from article)** — Pascal Cuoq, Feb 27 '16 at 14:23
A great, freely available, extended description of the Brisebarre-Muller method is at http://perso.ens-lyon.fr/jean-michel.muller/Journal_MultFMAC_Final.pdf — Pascal Cuoq, Feb 27 '16 at 15:00
@PascalCuoq: Thanks for the heads up! Yes, `zl` was missing the division by `y`. [Kahan](https://www.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps) (A Fused MAC also speeds up a grubby "Doubled-double" ... exactly without roundoff) describes the fma method for the remainder; and an earlier Brisebarre et al version of their article explicitly mentions it as a precise way to get the remainder. Indeed, if you check e.g. divisor 16777214, the `zl` computed via 80-bit long double yields 48 errors, but the fma method `zl` only 46 errors across all nonnegative finite float arguments. — Nominal Animal, Feb 27 '16 at 15:59
@PascalCuoq: Even though it is terrible code, you (and anybody else interested) might wish to take a look at [my verification program](http://nominal-animal.net/answers/divisors.c) I used to derive these results (for filtering and bit reverse I used awk). The program is Public Domain / CC0. If nothing else, it might yield additional ideas, or you could point out if I've made some stupid errors. Probably have.. :) — Nominal Animal, Feb 27 '16 at 16:09
@NominalAnimal Very nice work. BTW, it seems to me (based on my re-newed test framework), that there are other divisors > 2**24, besides those that satisfy the "odd after downscaling" test, that deliver correct results with Markstein's algorithm across all possible dividends. Do you see this as well, and is there a way to describe these additional divisors? I am still looking at the new data, probably won't be able to study it in detail the next couple of days. — njuffa, Feb 27 '16 at 16:31
@njuffa: Half of the even divisors between 8388608 and 16777216 have only one argument (dividend) for which Markstein fails (worth special-casing?). Doing explicit search for these, yes: I can find examples -- like 14707506 = 0x1.c0d664p+23 -- where the rounding just happens to go right, so that Markstein actually succeeds for all finite float dividends for such divisors! In such cases, Markstein succeeds for higher powers of two, too -- 0x1.c0d664p+24, 0x1.c0d664p+25, ..., 0x1.c0d664p+125 (but not the largest, +126 and +127, which fail for 128 or 127 cases, probably due to underflow). Hmm. — Nominal Animal, Feb 27 '16 at 17:57
Your loops `while (curr.u[0] < 0x7F800000U)` do a varying amount of unnecessary work depending on the value of `divisor`. As soon as no underflow or overflow happens during the check of a binade, the results obtained for that binade are exactly 2^k times the results that would have been obtained for any other binade in which no underflow or overflow happens (this applies to every method being considered here: naive, Markstein, Brisebarre-Muller, as well as to the reference correctly rounded division itself). — Pascal Cuoq, Feb 27 '16 at 22:10
@PascalCuoq: I see what you mean. However, I don't see how to incorporate it into the vectorized approach, and the exact error counts are *really* useful for me. Maybe changing the indexing? I think memory bandwidth precludes taking `curr` from an array: too slow. In any case, it'd get pretty darn complicated, fast. I'm not sure it would be worth the effort, especially with my poor math-fu.. keeping it simple keeps it verifiable, too. — Nominal Animal, Feb 28 '16 at 04:11
@njuffa: I added a much, much more accurate Markstein failure case count estimator (number of arguments Markstein will yield an incorrect answer, for a given divisor). In my testing (admittedly limited; divisors 1 to 20000 are verified, the rest randomly sampled) it yields the correct number of failure cases to within ±1 (actually, -0+1). — Nominal Animal, Feb 28 '16 at 04:48

score 1 · Answer 3 · edited May 23 '17 at 12:26

1

I love @Pascal's answer but in optimization it's often better to have a simple and well-understood subset of transformations rather than a perfect solution.

All current and common historical floating point formats had one thing in common: a binary mantissa.

Therefore, all fractions were rational numbers of the form:

x / 2ⁿ

This is in contrast to the constants in the program (and all possible base-10 fractions) which are rational numbers of the form:

x / (2ⁿ * 5^m)

So, one optimization would simply test the input and reciprocal for m == 0, since those numbers are represented exactly in the FP format and operations with them should produce numbers that are accurate within the format.

So, for example, within the (decimal 2-digit) range of .01 to 0.99 dividing or multiplying by the following numbers would be optimized:

.25 .50 .75

And everything else would not. (I think, do test it first, lol.)

edited May 23 '17 at 12:26

Community

1
1

answered Feb 22 '16 at 19:25

DigitalRoss

143,651
25
248
329

1

Note that the question already restricts the divisors to be considered to *integers*, since I figured addressing the question of arbitrary divisors would be way too hard. Thus the divisors considered are all exactly representable as `float` (up to 2**24). However, empirically, the only integer divisors for which I have shown the proposed code to work are of the form *2x+1* and *2\*\*n*. And even that is conjecture so far since I can't test them all (I keep my test app running to generate a white list). – njuffa Feb 22 '16 at 19:39
I do not want the question to get side-tracked into generalizations. But as a side note, clearly there are many more divisors other than odd integers and powers of two for which the code in the question delivers correct quotients across all possible dividends. For example, if I search in increments of `0.5f`, I get the following partial list: `PASS: 1, 1.5, 2, 2.5, 3, 4, 5, 5.5, 6.5, 7, 8, 9, 9.5, 10.5, 11, 13, 13.5, 14.5, 15, 16, 17, 17.5, 18.5, 19, 21, 21.5, 22.5, 23, 25, 25.5, 26.5, 27, 29, ` – njuffa Feb 22 '16 at 23:25
All of those numbers actually have a precise FP representation in a few bits, without repeating patterns, so by extension to my exact argument, they could be expected to work. But you have a point about sidetracking. Think of my answer as grist for some other mill. Not exactly your answer but maybe an answer for someone else. – DigitalRoss Feb 23 '16 at 20:51
@njuffa Sorry if I'm particularly thick here, but what is the importance of odd integer divisors specifically? Any non-zero `float` can be turned into an odd integer by.... drum roll.... scaling by an appropriate power of 2. So if you prove your FMA-based division works for all odd integers, and you know that bitwise-correct division by powers of 2 can be done easily, then you've proved the FMA algorithm works for all `floats`. – Iwillnotexist Idonotexist Feb 26 '16 at 05:37
1

@Iwillnotexist Idonotexist You are likely assuming that dividing by a power of two is an exact operation, but that is unfortunately not always so. When the result is a denormal, rounding may occur. This is why the code I posted in the question works for division by `3.0f`, but not for division by `6.0f`. You may now ask: why not use FTZ mode and avoid denormals? That makes the code fail since the computed residual suddenly underflows to zero. You can easily check for yourself for which divisors the code sequence will work by running the test app included with the question. – njuffa Feb 26 '16 at 05:49
@njuffa Then why not normalize the numerator s.t. it's in the range `[1,2)`, do FMA division using a scaled denominator as I propose, then undoing both scalings? – Iwillnotexist Idonotexist Feb 26 '16 at 16:47
@Iwillnotexist Idonotexist Not sure what you have in mind. Consider writing an answer (SO likes Q and A, not discussion). The test framework needed for testing is available in the question, you can just plug in whatever code sequence you have in mind. – njuffa Feb 26 '16 at 16:54

score 0 · Answer 4 · answered Feb 21 '16 at 22:45

The result of a floating point division is:

a sign flag
a significand
an exponent
a set of flags (overflow, underflow, inexact, etc - see fenv())

Getting the first 3 pieces correct (but the set of flags incorrect) is not enough. Without further knowledge (e.g. which parts of which pieces of the result actually matter, the possible values of the dividend, etc) I would assume that replacing division by a constant with multiplication by a constant (and/or a convoluted FMA mess) is almost never safe.

In addition; for modern CPUs I also wouldn't assume that replacing a division with 2 FMAs is always an improvement. For example, if the bottleneck is instruction fetch/decode, then this "optimisation" would make performance worse. For another example, if subsequent instructions don't depend on the result (the CPU can do many other instructions in parallel while waiting for the result) the FMA version may introduce multiple dependency stalls and make performance worse. For a third example, if all registers are being used then the FMA version (which requires additional "live" variables) may increase "spilling" and make performance worse.

Note that (in many but not all cases) division or multiplication by a constant multiple of 2 can be done with addition alone (specifically, adding a shift count to the exponent).

The question is tagged “C”. A C program that accesses the floating-point status flags without an explicit `#pragma STDC FENV_ACCESS ON` beforehand should not expect the results to be correct, so the compiler knows exactly when it has to preserve the flags and when it does not have to. The remark that makes the first half of your question applies (or does not apply, in most cases) to optimizations as elementary as constant propagation. (C11 7.6.1:2) — Pascal Cuoq, Feb 22 '16 at 12:23
As for the cost of division, on all processors I am aware of that support FMA in hardware, division is significantly more costly than two FMAs (or even five FMAs). On platforms with floating-point division in hardware, the proposed optimization can increase register pressure *slightly* but so can many other optimizations such as CSE or early load scheduling. On platforms that perform floating-point division in software the proposed code will very likely *lower* register pressure, as a general purpose IEEE-compliant division routine could easily require ten live registers at the "widest" point — njuffa, Feb 22 '16 at 21:57

Efficient floating-point division with constant integer divisors

4 Answers4

Linked