Considering this function,
float mulHalf(float x) {
return x * 0.5f;
}
the following function produces the same result with normal input/output.
float mulHalf_opt(float x) {
__m128i e = _mm_set1_epi32(-1 << 23);
__asm__ ("paddd\t%0, %1" : "+x"(x) : "xm"(e));
return x;
}
This is the assembly output with -O3 -ffast-math
.
mulHalf:
mulss xmm0, DWORD PTR .LC0[rip]
ret
mulHalf_opt:
paddd xmm0, XMMWORD PTR .LC1[rip]
ret
-ffast-math
enables -ffinite-math-only
which "assumes that arguments and results are not NaNs or +-Infs" [1].
So the compiled output of mulHalf
might better use paddd
with -ffast-math
on if doing so produces faster code under the tolerance of -ffast-math
.
I got the following tables from Intel Intrinsics Guide.
(MULSS)
Architecture Latency Throughput (CPI)
Skylake 4 0.5
Broadwell 3 0.5
Haswell 5 0.5
Ivy Bridge 5 1
(PADDD)
Architecture Latency Throughput (CPI)
Skylake 1 0.33
Broadwell 1 0.5
Haswell 1 0.5
Ivy Bridge 1 0.5
Clearly, paddd
is a faster instruction. Then I thought maybe it's because of the bypass delay between integer and floating-point units.
This answer shows a table from Agner Fog.
Processor Bypass delay, clock cycles
Intel Core 2 and earlier 1
Intel Nehalem 2
Intel Sandy Bridge and later 0-1
Intel Atom 0
AMD 2
VIA Nano 2-3
Seeing this, paddd
still seems like a winner, especially on CPUs later than Sandy Bridge, but specifying -march
for recent CPUs just change mulss
to vmulss
, which has a similar latency/throughput.
Why don't GCC and Clang optimize multiplication by 2^n with a float to paddd
even with -ffast-math
?