4

In one of our last assignments in Computer Science this term we have to apply strength reduction on some code fragments. Most of them were just straight forward, especially with looking into compiler output. But one of them I wont be able to solve, even with the help of the compiler.

Our profs gave us the following hint:

Hint: Inquire how IEEE 754 single-precision floating-point numbers are represented in memory.

Here is the code snippet: (a is of type double*)

for (int i = 0; i < N; ++i) {
    a[i] += i / 5.3;   
}

At first I tried to look into the compiler output for this snipped on godbolt. I tried to compile it without any optimization: (note: I copied only the relevant part in the for loop)

    mov     eax, DWORD PTR [rbp-4]
    cdqe
    lea     rdx, [0+rax*8]
    mov     rax, QWORD PTR [rbp-16]
    add     rax, rdx
    movsd   xmm1, QWORD PTR [rax]
    cvtsi2sd        xmm0, DWORD PTR [rbp-4]    //division relevant
    movsd   xmm2, QWORD PTR .LC0[rip]          //division relevant
    divsd   xmm0, xmm2                         //division relevant
    mov     eax, DWORD PTR [rbp-4]
    cdqe
    lea     rdx, [0+rax*8]
    mov     rax, QWORD PTR [rbp-16]
    add     rax, rdx
    addsd   xmm0, xmm1
    movsd   QWORD PTR [rax], xmm0

and with -O3:

.L2:
        pshufd  xmm0, xmm2, 238             //division relevant
        cvtdq2pd        xmm1, xmm2          //division relevant
        movupd  xmm6, XMMWORD PTR [rax]
        add     rax, 32
        cvtdq2pd        xmm0, xmm0          //division relevant
        divpd   xmm1, xmm3                  //division relevant
        movupd  xmm5, XMMWORD PTR [rax-16]
        paddd   xmm2, xmm4
        divpd   xmm0, xmm3                  //division relevant
        addpd   xmm1, xmm6
        movups  XMMWORD PTR [rax-32], xmm1
        addpd   xmm0, xmm5
        movups  XMMWORD PTR [rax-16], xmm0
        cmp     rax, rbp
        jne     .L2

I commented the division part of the assembly code. But this output does not help me understanding how to apply strength reduction on the snippet. (Maybe there are too many optimizations going on to fully understand the output)

Second, I tried to understand the bit representation of the float part 5.3. Which is:

0   |10000001|01010011001100110011010
Sign|Exponent|Mantissa

But this does not help me either.

phuclv
  • 37,963
  • 15
  • 156
  • 475
Zagatho
  • 523
  • 1
  • 6
  • 22
  • @TonyK I already inquired that, but I do not think that is the correct way. The result of `1.0/5.3` is a very odd number. I even checked the bit representation. – Zagatho Jun 16 '22 at 19:44
  • The only use for strength reduction that I can see here is the computation of the address of `a[i]`. This is a weird assignment! – TonyK Jun 16 '22 at 19:50
  • Some people would consider multiplication to be a "less strong" (expensive) instruction than division, but `a[i] += i * (1.0/5.3)` would still be computing separately for each `a[i]`. It would *enable* simple strength reduction of multiplication to addition, or better just contracting the expression to one FMA on hardware that supports it. See [Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions?](https://stackoverflow.com/q/72306573) (including my answer) re: naive strength reduction creating a big performance problem due to FP latency – Peter Cordes Jun 17 '22 at 09:05
  • 1
    Of course, that multiplicative inverse wouldn't be precise because `1/5.3` isn't exactly representable. (Nor is 5.3, but after rounding to the nearest double, the inverse of that also isn't exactly representable.) I'd assume that a strength-reduction on FP operations is going to introduce different rounding error, perhaps even the possibility of error accumulating across iterations. – Peter Cordes Jun 17 '22 at 09:08
  • Your example on Godbolt has compile-time-visible UB, dereferencing an uninitialized `double *a`. See [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) - write a function that takes an arg. https://godbolt.org/z/Eba4cY1Ws. (`-O3 -ffast-math` lets the compiler multiply by the reciprocal.) It's a fairly straightforward auto-vectorization using `i++` with a vector of 4 `int` elements, which it unpacks and converts to `double` each time, for `mulpd` or `divpd`, to feed an `addpd`.) Clang makes simpler asm if you tell it not to unroll, just 2 elems. – Peter Cordes Jun 18 '22 at 00:46
  • I don't see how the binary format of IEEE floating point is relevant here, though. If you eventually find out what answer was expected which does take advantage of that, I'd be curious. – Peter Cordes Jun 18 '22 at 00:47
  • I will update my question as soon as I get the result of the task – Zagatho Jun 18 '22 at 12:00
  • Gentle reminder about the promised update :) – amonakov Jul 19 '22 at 17:58

3 Answers3

3

CAVEAT: After a few days I realized that this answer is incorrect in that it ignores the consequence of underflow (to subnormal or to zero) in the computation of o / 5.3. In this case, multiplying the result by a power of two is “exact” but does not produce the result that dividing the larger integer by 5.3 would have.

i / 5.3 only needs to be computed for odd values of i. For even values of i, you can simply multiply by 2.0 the value of (i/2)/5.3, which was already computed earlier in the loop.

The remaining difficulty is to reorder the iterations in a way such that each index between 0 and N-1 is handled exactly once and the program does not need to record an arbitrary number of division results.

One way to achieve this is to iterate on all odd numbers o less than N, and after computing o / 5.3 in order to handle index o, to also handle all indexes of the form o * 2**p.

if (N > 0) {
  a[0] += 0.0; // this is needed for strict IEEE 754 compliance lol
  for (int o = 1; o < N; o += 2) {
    double d = o / 5.3;
    int i = o;
    do { 
      a[i] += d;
      i += i;
      d += d;
    } while (i < N);
  }
}

Note: this does not use the provided hint “Inquire how IEEE 754 single-precision floating-point numbers are represented in memory”. I think I know pretty well how single-precision floating-point numbers are represented in memory, but I do not see how that is relevant, especially since there are no single-precision values or computations in the code to optimize. I think there is a mistake in the way the problem is expressed, but still the above is technically a partial answer to the question as phrased.

I also ignored overflow problems for values of N that come close to INT_MAX in the code snippet above, since the code is already complicated enough.

As an additional note, the above transformation only replaces one division out of two. It does this by making the code unvectorizable (and also less cache-friendly). In your question, gcc -O3 has already shown that automatic vectorization could be applied to the starting point that your professor suggested, and that is likely to be more beneficial than suppressing half the divisions can be. The only good thing about the transformation in this answer is that it is a sort of strength reduction, which your professor requested.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • 1
    Handling for a[0] is not correct, should be '+= 0' instead of '= 0'. Any reason you're using multiplications in the inner loop without strength-reducing them to additions? :) – amonakov Jun 17 '22 at 08:02
  • @amonakov I took into account your first remark, and I **think** I did the second one, but it is getting quite complicated now, does it look correct? – Pascal Cuoq Jun 17 '22 at 09:12
  • 1
    My comment was a bit tongue-in-cheek; I think you can double d instead of computing p*d, yielding: for (int i = o; i < N; d += d, i += i) a[i] += d; but this chains fp additions; you code looks correct to me. – amonakov Jun 17 '22 at 10:03
  • The cache access pattern of this is pretty bad if the whole array doesn't fit in L1d or at least L2 cache. You're looping over it multiple times. And much worse, at large varying strides, so usually you're only modifying one double in any cache line you touch. The non-constant stride probably defeats hardware prefetching, although the `i` address calculations are simple so OoO exec can have plenty of those in the pipeline. But L2 prefetch won't see the pattern to try to run ahead of the demand loads. – Peter Cordes Jun 17 '22 at 10:14
  • So for N>(256KiB/8B) (exceeding L2 cache size on typical CPUs), I'd expect naive division to be faster than this on a CPU with fast `divpd` (like Skylake with 2 doubles per 4 clocks `vdivpd` throughput whether that's with XMM or YMM), since the original `a[i] += i/5.3` loop is pretty trivial to auto-vectorize. (SIMD integer increment, convert to FP, divide, pure-vertical, compilers will do this for you.) – Peter Cordes Jun 17 '22 at 10:18
  • Older CPUs, like Haswell from almost a decade ago, have slower packed-double division, like 2 doubles per 16 clocks, so the break-even point might be for larger arrays depending on L3 cache / DRAM vs. divide throughput. – Peter Cordes Jun 17 '22 at 10:21
  • @PeterCordes Some optimizations make the execution time worse. The question was about strength reduction and I answered as best I could about applying strength reduction to the provided example. I am not saying that I expect the transformed code to perform better in practice, and I am not stating otherwise either because it is not really part of the question. – Pascal Cuoq Jun 17 '22 at 11:18
  • I'd at least have mentioned that the access pattern is a likely performance problem, as well as preventing SIMD, despite potentially solving the abstract algorithm / math problem that was stated. I feel like that's an important disclaimer. I know the question didn't ask for a speedup, but it's a good idea to warn future readers not to blindly use this. – Peter Cordes Jun 17 '22 at 11:23
  • @PeterCordes Done. I'm almost tempted to apply strength reduction to the floating-point multiplication now, since you forced me to be principled about it, even though that suggestion by amonakov was only tongue-in-cheek. – Pascal Cuoq Jun 17 '22 at 11:33
3

If we adopt Wikipedia's definition that

strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations

then we can apply strength reduction here by converting the expensive floating-point division into a floating-point multiply plus two floating-point multiply-adds (FMAs). Assuming that double is mapped to IEEE-754 binary64, the default rounding mode for floating-point computation is round-to-nearest-or-even, and that int is a 32-bit type, we can prove the transformation correct by simple exhaustive test:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <math.h>

int main (void)
{
    const double rcp_5p3 = 1.0 / 5.3; // 0x1.826a439f656f2p-3
    int i = INT_MAX;
    do {
        double ref = i / 5.3;
        double res = fma (fma (-5.3, i * rcp_5p3, i), rcp_5p3, i * rcp_5p3);
        if (res != ref) {
            printf ("error: i=%2d  res=%23.13a  ref=%23.13a\n", i, res, ref);
            return EXIT_FAILURE;
        }
        i--;
    } while (i >= 0);
    return EXIT_SUCCESS;
}

Most modern instances of common processors architectures like x86-64 and ARM64 have hardware support for FMA, such that fma() can be mapped directly to the appropriate hardware instruction. This should be confirmed by looking at the disassembly of the binary generated. Where hardware support for FMA is lacking the transformation obviously should not be applied, as software implementations of fma() are slow and sometimes functionally incorrect.

The basic idea here is that mathematically, division is equivalent to multiplication with the reciprocal. However, that is not necessarily true for finite-precision floating-point arithmetic. The code above tries to improve the likelihood of bit-accurate computation by determining the error in the naive approach with the help of FMA and applying a correction where necessary. For background including literature references see this earlier question.

To the best of my knowledge, there is not yet a general mathematically proven algorithm to determine for which divisors paired with which dividends the above transformation is safe (that is, delivers bit-accurate results), which is why an exhaustive test is strictly necessary to show that the transformation is valid.

In comments, Pascal Cuoq points out that there is an alternative algorithm to potentially strength-reduce floating-point division with a compile-time constant divisor, by precomputing the reciprocal of the divisor to more than native precision and specifically as a double-double. For background see N. Brisebarre and J.-M. Muller, "Correctly rounded multiplication by arbirary precision constant", IEEE Transactions on Computers, 57(2): 162-174, February 2008, which also provides guidance how to determine whether that transformation is safe for any particular constant. Since the present case is simple, I again used exhaustive test to show it is safe. Where applicable, this will reduce the division down to one FMA plus a multiply:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <mathimf.h>

int main (void)
{
    const double rcp_5p3_hi =  1.8867924528301888e-1; //  0x1.826a439f656f2p-3
    const double rcp_5p3_lo = -7.2921377017921457e-18;// -0x1.0d084b1883f6e0p-57
    int i = INT_MAX;
    do {
        double ref = i / 5.3;
        double res = fma (i, rcp_5p3_hi, i * rcp_5p3_lo);
        if (res != ref) {
            printf ("i=%2d  res=%23.13a  ref=%23.13a\n", i, res, ref);
            return EXIT_FAILURE;
        }
        i--;
    } while (i >= 0);
    return EXIT_SUCCESS;
}
njuffa
  • 23,970
  • 4
  • 78
  • 130
  • 1
    Neat. You say there are some divisors where this does work exactly. Is that usually only for integer-valued dividends of limited range, like here? Or are there some divisors where this works for *every* dividend? (Other than power of 2 divisors where the original FP reciprocal is exact in the first place). It is practical to exhaustively test every 32-bit float bit-pattern so this is something to could be checked for any given divisor, at least for binary32. – Peter Cordes Jun 17 '22 at 10:00
  • @PeterCordes This method (best I know originally due to Markstein) works for many combinations of divisors and dividends, but certainly not for all, at all precisions. I do not know how to demonstrate validity for any particular combination other than by exhaustive test. When I last searched two years ago, I was not able to find anything in the literature about a-priori determination whether the transformation is safe or not. – njuffa Jun 17 '22 at 10:08
  • So nothing's known about which sets of dividends make it more or less likely for this to be exact for a divisor? Like the fact that 32-bit integer values all produce binary64 doubles with some trailing zeros in the mantissa; I wondered if that was important. – Peter Cordes Jun 17 '22 at 10:13
  • 1
    @PeterCordes Well, *I* do not know. There may be people in academia or industry that do know. Maybe the mathematicians in one of the French research teams active in the floating-point field. I follow the literature in fits and starts; every couple of years I trawl the internet for the latest developments. I have not looked at this particular issue of floating-point division by compile-time constant in a while; this question seemed like an interesting scenario to go and try whether it is applicable here (yes). – njuffa Jun 17 '22 at 10:22
  • Ok, that's fine, I just wanted to make sure you understood what I was asking. Thanks for sharing what you do know about this technique! If I ever get curious and try some tests myself (or maybe search the literature), I might check if there are any divisors where this works for every dividend, including non-integer ones, at least in some large range. – Peter Cordes Jun 17 '22 at 10:32
  • 1
    @PeterCordes Just in case you didn't know that that was available, section 5.5 in https://doc.lagout.org/science/0_Computer%20Science/3_Theory/Handbook%20of%20Floating%20Point%20Arithmetic.pdf seems to cover this. I would interpret the sentence “Method 3 is the most complex, but it always gives an answer (either it certifies, for a given C, that the algorithm will always return RN(C · x), or it returns all the counterexamples)” as meaning the question is solved. – Pascal Cuoq Jun 17 '22 at 11:49
  • 1
    @PascalCuoq: Thanks, I didn't know about that textbook. It says this method does typically work for *all* dividends, for most divisors. Not that it's more likely for integer-valued floats. They show a way to check other than exhaustive testing (which is impractical for binary64 or wider) which identifies which cases are potentially problematic. I haven't tried to figure out whether a number like 5.3 that isn't itself an exact integer or the reciprocal of one might make this method work for all integer-valued dividends even if it would fail for some floats, but that might well be the case. – Peter Cordes Jun 17 '22 at 12:04
  • 1
    @PascalCuoq I do have that book and read the paper by Brisebarre and Muller it references years ago. Thanks for the reminder. That is a *different* algorithm though, requiring the precomputation of the reciprocal to higher precision: `rcp_5p3_hi=1.8867924528301888e-1; rcp_5p3_lo=-7.2921377017921457e-18; res = fma (i, rcp_5p3_hi, i * rcp_5p3_lo);`. I will add that to the answer. – njuffa Jun 17 '22 at 17:40
3

To cover another aspect: since all values of type int are exactly representable as double (but not as float), it is possible to get rid of int-to-double conversion that happens in the loop when evaluating i / 5.3 by introducing a floating-point variable that counts from 0.0 to N:

double fp_i = 0;
for (int i = 0; i < N; fp_i += 1, i++)
    a[i] += fp_i / 5.3;

However, this kills autovectorization, and introduces a chain of dependent floating-point additions. Floating point addition is typically 3 or 4 cycles, so the last iteration will retire after at least (N-1)*3 cycles, even if the CPU could dispatch the instructions in the loop faster. Thankfully, floating-point division is not fully pipelined, and the rate at which an x86 CPU can dispatch floating-point division roughly matches or exceeds latency of the addition instruction.

This leaves the problem of killed vectorization. It's possible to bring it back by manually unrolling the loop and introducing two independent chains, but with AVX you'd need four chains for full vectorization:

double fp_i0 = 0, fp_i1 = 1;
int i = 0;
for (; i+1 < N; fp_i0 += 2, fp_i1 += 2, i += 2) {
    double t0 = a[i], t1 = a[i+1];
    a[i]   = t0 + fp_i0 / 5.3;
    a[i+1] = t1 + fp_i1 / 5.3;
}
if (i < N)
    a[i] += i / 5.3;
amonakov
  • 2,324
  • 11
  • 23