2

Let's say I want to divide unsigned int by 2 or 4 or 8, etc. AFAIK compiler replaces such division with a shift.

But can I expect that instead of dividing float by 128 it instead subtracts 7 from its exponent part?

What is the best practice to ensure that exponent subtraction is used instead of floating division?

Vasily
  • 65
  • 6
  • I'd be interested in hearing what you think it *does* do it that case. I'm pretty certain it's not carrying our repeated subtraction :-) I think you'll find the developers (of both your hardware and your runtime libraries) have optimised this stuff as much as they can. – paxdiablo Nov 16 '19 at 06:07
  • @paxdiablo, the difference is that I'm not that certain. For example, it only works for an integer if it is positive. If an integer is negative shifts don't work. The same can be true for float. If compiler can't exclude some rare cases where this subtraction somehow fails it can just divide. – Vasily Nov 16 '19 at 06:16
  • 2
    You will find that the FPU hardware does exactly this, and any attempt you make to second-guess it will only slow it down. – user207421 Nov 16 '19 at 06:32
  • This sounds like micro-optimization. In most cases, it's not worth the effort. There might be very specific cases, where lots of `float` values have to be divided always by exactly 128. (And, it's granted that the mantissa is normalized.) And, your algorithm fails the required time limit by a few nano seconds... Though, you should've mentioned that in your question, in this case. ;-) – Scheff's Cat Nov 16 '19 at 07:32
  • 1
    If you divide `float` or `double` by the constant `128`, this division will be replaced by a multiplication. – Evg Nov 16 '19 at 07:55
  • (Near) duplicates: [Why doesn't a compiler optimize floating-point \*2 into an exponent increment?](https://stackoverflow.com/q/12919184) / [Why don't GCC and Clang optimize multiplication by 2^n with a float to integer PADDD of the exponent, even with -ffast-math?](https://stackoverflow.com/q/72367940) – Peter Cordes Jun 05 '23 at 01:50

2 Answers2

4

If you are multiplying or dividing by a constant, a compiler of modest quality should optimize it. On many platforms, a hardware multiply instruction may be optimal.

For multiplying (or dividing) by a power of two, std::ldexp(x, p) multiplies x by 2p, where p is an int (and divides if p is negated). I would not expect much benefit over simple multiplication on most platforms, as manual (software) exponent manipulation must include checks for overflow and underflow, so the resulting sequence of instructions is not likely to improve over a hardware multiply in most situations.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • Not accounting for eventual bugs in those underflow checks to be expected both with MSVC https://stackoverflow.com/questions/32150888/should-ldexp-round-correctly and Apple 64bits libm https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/383 – aka.nice Nov 16 '19 at 20:12
  • well, but at least this is the function that supposed to do exactly what I need. I didn't know such thing exists. – Vasily Nov 17 '19 at 04:34
2

TL;DR

Use the default division with / operator

Long Answer

I wrote a half-working bitwise implementation of float powers of 2 multiplication/division (doesn't account for NaN, Infinity and 0 as Peter Cordes pointed out), but found out that it was still performing slightly worse than the native / with power of 2 division, albeit better than non-power of two division.

This suggests that GCC performs some sort of optimization on powers of 2 division, which we can confirm by looking at the x86-64 assembly it generates, on Godbolt.

For powers of 2, 1/2.0f = 0.5f is exactly representable without introducing any rounding error, so n/2.0f is exactly equivalent to n * 0.5f. GCC knows it's safe to make that optimization even without -ffast-math. Binary floating-point uses base 2 exponents for mantissa * 2^exp, so a power of 2 value (including negative powers like 0.5 = 2^-1) can be exactly represented with a 1.0 mantissa.

#include "stdio.h"

union float_int{
    unsigned int i;
    float f;
};

float float_pow2(float n, int pow){
    union float_int u;
    u.f = n;
    unsigned int exp = ((u.i>>23)&0xff) + pow;
    u.i = (u.i&0b10000000011111111111111111111111) | (exp<<23);
    return u.f;
}

int main(){
    float n = 3.14;
    float result = 0;
    for(int i = 0; i < 1000000000; i++){
        // Uncomment one of the four
        // result += n/2.01f;
        // result += n/2.0f;
        // result += n/2;
        result += float_pow2(n,-1);
        
        // Prevent value re-use
        // n == 100003.14 by the time the loop ends, and the result will be in float range
        n += 0.01f;
    }
    printf("%f\n", result);
}

Performance

The code was compiled with GCC -O3. Without optimization, the compiler will not inline float_pow2, and will store/reload after every statement, so the custom function's performance would be even worse because it's done with multiple statements.

Builtin division performance with 2.01f as divisor (x86 divss)

real    0m1.907s
user    0m1.896s
sys 0m0.000s

Builtin division performance with 2.0f as divisor (x86 mulss)

real    0m0.798s
user    0m0.791s
sys 0m0.004s

Builtin division performance with 2 as divisor (x86 mulss)

real    0m0.798s
user    0m0.794s
sys 0m0.004s

Custom division performance (float_pow2)

(GCC copies the data to an integer register and back, instead of using SSE2 vector-integer math instructions.)

real    0m0.968s
user    0m0.967s
sys 0m0.000s

About the accuracy, the the standard deviation on the last test was 0.018 seconds out of ten tests performed. Others seem to fall in similar ranges of consistency.
The performance of 2.0f and 2.0 were almost the same, and indeed they were getting compiled into the same assembly according to godbolt.org.


Performance analysis of what the benchmark is actually measuring (this section written by @Peter Cordes):

The benchmark measures latency of float addition, or total throughput cost of the loop body if that's higher. (e.g. if the compiler can't optimize division into multiplication, see Floating point division vs floating point multiplication).

Or with float_pow2, it's complicated on Intel CPUs: https://uica.uops.info/ 's prediction for Skylake and Ice Lake for GCC's asm loops (copy/pasted from Godbolt) are pretty close to the measured results: about 4.9 cycles per iteration for the float_pow2 loop vs. 4.0c for the /= 2 (aka *= 0.5f) loop. That 4.9c / 4c = 1.21 performance ratio is very close to .968s / .798s = 1.21

Its analysis shows that it's not a throughput bottleneck like divss was (the execution units could run that amount of work in 2.25 cycles per iteration if it didn't have to wait for inputs to be ready). And the two addss dependency chains alone are in theory still 4 cycles long each. (Skylake has 2/clock FP add/mul throughput, with 4 cycle latency.)

But in some cycles when an addss was ready to execute, it had to wait a cycle because a different uop was the oldest one ready for that execution port. (Probably because movd eax, xmm0 and addss xmm0, xmm2 are both waiting for the same XMM0 input, the result of addss xmm0, xmm2 in the previous iteration. This is the n += 0.01f. And of those addss xmm0, xmm2 uops get scheduled to port 0 where they run into this resource conflict which delays progress on the critical path dependency chain, the n += 0.01f)

So what we're really measuring here are the resource conflicts created by the extra work of float_pow2 interfering with two FP addition latency bottlenecks. If an addition doesn't run as soon as its input was ready, there's no way to make up that lost time. That's because it's a latency bottleneck, not throughput. Unrolling with multiple n1 += 0.02f / n2 += 0.02f and so on can avoid that, but compilers can't do that without -ffast-math because it can introduce different rounding error.

mulss being only a single instruction could in theory create the same bottleneck, but in this case the uop scheduling tends to work out so it doesn't steal cycles from the critical path.

BTW, the dependency pattern of two add chains connected by a multiply (or some other operations) is the same as Latency bounds and throughput bounds for processors for operations that must occur in sequence

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
martian17
  • 418
  • 3
  • 14
  • Your `float_pow2` isn't checking for `0.0` is it? Or infinities or NaN. So `float_pow2(0.0f, 10)` will be `FLT_MIN * 2^9` instead of the correct `0.0 * 1024 == 0`, as well as munging NaNs and infinities. And it's still not faster. – Peter Cordes Jun 05 '23 at 01:42
  • But wait a minute, with `result += float_pow2(n,-1);` the compiler can optimize it to `result += loop_invariant`, hoisting the work out of the loop. So you're maybe just benchmarking FP addition latency. `float_pow2` isn't part of the dependency chain so you're not testing the extra bypass-forwarding latency between integer and FP operations if the compiler actually uses SSE2 `paddd` with `n<<23` or something. (Although probably all the mainstream compilers would `movd` to an integer register and back.) – Peter Cordes Jun 05 '23 at 01:48
  • 1
    https://godbolt.org/z/b319Ks7vo shows the compiler-generated asm from GCC and clang. They both unroll the loop some (but not with multiple accumulators because we didn't use `-ffast-math`, so it doesn't change the bottleneck on FP-add latency.) Anyway, yeah, the asm for `result += float_pow2(n,-1);` is identical to that for `result += n/2;`, just adding the same FP constant that it calculated at compile time. If you want to use something like Google::Benchmark, its `DoNotOptimize` helper function could work to force the compiler to forget about the value of a variable and have to compute – Peter Cordes Jun 05 '23 at 05:49
  • Your answer still incorrectly claims you measured the performance of your "custom division" function. Assuming you used GCC or clang (or any half-decent compiler), you didn't. You just timed FP addition latency, and `float_pow2` didn't even run once, evaluated at compile-time. Even for compilers that don't constant-propagate through it, they should still realize the value is loop-invariant. – Peter Cordes Jun 05 '23 at 07:53
  • Thanks for the inputs, and I've corrected my answer to the best of my ability, including the value re-use (I was in the middle of editing, sorry for the confusion and inconsistency). Please point out any errors if there still is one. My methods may have been flawed, but the performance difference between dividing by 2 and 2.01 was clear and consistent. so the conclusion should still be that `/` should be used above everything else. – martian17 Jun 05 '23 at 08:05
  • 1
    Your benchmark does at least run it now, but the speed of your loops are mostly limited by FP addition latency, which is enough to almost cover the throughput cost of the `float_pow2` integer work, as in [this Q&A with an exercise from CS:APP](https://stackoverflow.com/q/63095394). (Or by FP division throughput with a divisor that doesn't have an exact inverse to let the compiler replace it with `* (1.0f / divisor)` like it can for `/ 2.0` being exactly equivalent to `* 0.5`. And also possible FP conversion to/from `double` since you used `2.01` not `2.01f`. – Peter Cordes Jun 05 '23 at 08:26
  • 1
    https://godbolt.org/z/55q4dqMPq puts the latency of `float_pow2` in the critical path. (Multiplying with `inf` or `nan` doesn't slow down SSE FPUs, unlike legacy x87, so it's ok that it eventually overflows.) – Peter Cordes Jun 05 '23 at 08:26
  • I ran your code from godbolt, and the performance seem to vary significantly depending on whether the floating point value `x` is overflowing or underflowing. I'm not convinced that operating on `inf` or `nan` is equivalent in performance to operating on on normal floating point value. – martian17 Jun 05 '23 at 09:21
  • 1
    Perhaps there's a subnormal involved somehow, requiring an FP assist? But you're subtracting 1 from the exponent field, not adding. I'll play with this tomorrow, maybe tring with `2.0` so it doesn't overflow. Note that Godbolt runs on AWS instances that have different CPUs speeds and different competing loads each time. I'm pretty confident `mulss` isn't slower when one or both input is Inf or NaN on modern Intel hardware. – Peter Cordes Jun 05 '23 at 09:26
  • 1
    I tested on my own system (i7-6700k Skylake at 3.9GHz). The latency benchmark I constructed for `float_pow2` (https://godbolt.org/z/j8e7dWe1M) runs in the same time (2.837 sec +- a couple ms) for `x *= 2.0f` (so the final result is `1.250000` and GCC happened not to optimize away the benchmark), vs. `x *= 2.01f` (so we reach `inf` pretty early) or `x *= 2.00001f` (so we reach inf a bit later). – Peter Cordes Jun 05 '23 at 14:47
  • 1
    Under `perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,fp_assist.any ./a.out`, I can see `0` counts for `fp_assist.any`. That means there were no data-dependent slowdown in FP math, just the fixed 4-cycle latency of `mulss`. If I use `long double x` (which is the 80-bit x87 type since I'm on Linux not Windows), a value that stays finite ends up taking 4.4 sec. A `2.01` multiplier takes 130.9 sec (extrapolated from 100x fewer iters), with a count of `9,982,256` for `fp_assist.any`. – Peter Cordes Jun 05 '23 at 14:52
  • 1
    If you're not seeing stable / consistent timings when you run the program on your system, maybe check [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) – Peter Cordes Jun 05 '23 at 14:53
  • 1
    So my actual benchmark results: (https://godbolt.org/z/cee59anrb with updated plain-C things to compare against) timing the latency of `x = float_pow2(x, -1)` plus `x *= 2.0001f` at about 2.83 sec, vs. 2.05 sec for `x /= 2.0f; x *= 2.0001f;`. And throughput should also be better for `x /= 2.0f` but I didn't measure that. – Peter Cordes Jun 05 '23 at 15:01
  • 1
    Your current benchmark bottlenecks on whichever is worse: FP addition latency, or throughput of `divss` or `float_pow2`. (With `result += n/2` or 2.0 or 2.0f, the throughput of multiplying by `0.5` (up to two per clock cycle) is not a bottleneck; with two additions per FP-add latency (4 cycles on Skylake) the FP execution units are idle for most of the time. That's why `float_pow2` only times a little bit slower than `n/2`: the addition latency bottleneck was so much slower than `mulss` throughput that most of `float_pow2`'s throughput cost is still "hidden" in the shadow of the latency. – Peter Cordes Jun 05 '23 at 16:16
  • 1
    Actually it's not `float_pow2` throughput that's a bottleneck; I was realizing that didn't make sense (since it's not that many operations) so I had a look. I added a section to your answer about exactly what your benchmark is measuring and why the result is what it is. It ended up being a very big edit, and if you don't want to keep all of it, that's understandable. I can put it somewhere else or just mention in comments that it's in the edit history. But if you don't mind leaving it in your answer, I think that's a good place for it. (It's still not an ideal benchmark, though!) – Peter Cordes Jun 05 '23 at 17:02