TL;DR
Use the default division with /
operator
Long Answer
I wrote a half-working bitwise implementation of float powers of 2 multiplication/division (doesn't account for NaN, Infinity and 0 as Peter Cordes pointed out), but found out that it was still performing slightly worse than the native /
with power of 2 division, albeit better than non-power of two division.
This suggests that GCC performs some sort of optimization on powers of 2 division, which we can confirm by looking at the x86-64 assembly it generates, on Godbolt.
For powers of 2, 1/2.0f = 0.5f
is exactly representable without introducing any rounding error, so n/2.0f
is exactly equivalent to n * 0.5f
. GCC knows it's safe to make that optimization even without -ffast-math
. Binary floating-point uses base 2 exponents for mantissa * 2^exp
, so a power of 2 value (including negative powers like 0.5 = 2^-1
) can be exactly represented with a 1.0 mantissa.
#include "stdio.h"
union float_int{
unsigned int i;
float f;
};
float float_pow2(float n, int pow){
union float_int u;
u.f = n;
unsigned int exp = ((u.i>>23)&0xff) + pow;
u.i = (u.i&0b10000000011111111111111111111111) | (exp<<23);
return u.f;
}
int main(){
float n = 3.14;
float result = 0;
for(int i = 0; i < 1000000000; i++){
// Uncomment one of the four
// result += n/2.01f;
// result += n/2.0f;
// result += n/2;
result += float_pow2(n,-1);
// Prevent value re-use
// n == 100003.14 by the time the loop ends, and the result will be in float range
n += 0.01f;
}
printf("%f\n", result);
}
Performance
The code was compiled with GCC -O3
. Without optimization, the compiler will not inline float_pow2
, and will store/reload after every statement, so the custom function's performance would be even worse because it's done with multiple statements.
Builtin division performance with 2.01f as divisor (x86 divss
)
real 0m1.907s
user 0m1.896s
sys 0m0.000s
Builtin division performance with 2.0f as divisor (x86 mulss
)
real 0m0.798s
user 0m0.791s
sys 0m0.004s
Builtin division performance with 2 as divisor (x86 mulss
)
real 0m0.798s
user 0m0.794s
sys 0m0.004s
Custom division performance (float_pow2)
(GCC copies the data to an integer register and back, instead of using SSE2 vector-integer math instructions.)
real 0m0.968s
user 0m0.967s
sys 0m0.000s
About the accuracy, the the standard deviation on the last test was 0.018 seconds out of ten tests performed. Others seem to fall in similar ranges of consistency.
The performance of 2.0f
and 2.0
were almost the same, and indeed they were getting compiled into the same assembly according to godbolt.org.
Performance analysis of what the benchmark is actually measuring (this section written by @Peter Cordes):
The benchmark measures latency of float
addition, or total throughput cost of the loop body if that's higher. (e.g. if the compiler can't optimize division into multiplication, see Floating point division vs floating point multiplication).
Or with float_pow2
, it's complicated on Intel CPUs: https://uica.uops.info/ 's prediction for Skylake and Ice Lake for GCC's asm loops (copy/pasted from Godbolt) are pretty close to the measured results: about 4.9 cycles per iteration for the float_pow2
loop vs. 4.0c for the /= 2
(aka *= 0.5f
) loop. That 4.9c / 4c = 1.21 performance ratio is very close to .968s / .798s = 1.21
Its analysis shows that it's not a throughput bottleneck like divss
was (the execution units could run that amount of work in 2.25 cycles per iteration if it didn't have to wait for inputs to be ready).
And the two addss
dependency chains alone are in theory still 4 cycles long each. (Skylake has 2/clock FP add/mul throughput, with 4 cycle latency.)
But in some cycles when an addss
was ready to execute, it had to wait a cycle because a different uop was the oldest one ready for that execution port. (Probably because movd eax, xmm0
and addss xmm0, xmm2
are both waiting for the same XMM0 input, the result of addss xmm0, xmm2
in the previous iteration. This is the n += 0.01f
. And of those addss xmm0, xmm2
uops get scheduled to port 0 where they run into this resource conflict which delays progress on the critical path dependency chain, the n += 0.01f
)
So what we're really measuring here are the resource conflicts created by the extra work of float_pow2
interfering with two FP addition latency bottlenecks. If an addition doesn't run as soon as its input was ready, there's no way to make up that lost time. That's because it's a latency bottleneck, not throughput. Unrolling with multiple n1 += 0.02f
/ n2 += 0.02f
and so on can avoid that, but compilers can't do that without -ffast-math
because it can introduce different rounding error.
mulss
being only a single instruction could in theory create the same bottleneck, but in this case the uop scheduling tends to work out so it doesn't steal cycles from the critical path.
BTW, the dependency pattern of two add chains connected by a multiply (or some other operations) is the same as Latency bounds and throughput bounds for processors for operations that must occur in sequence