From my own tests, I'm coming to the following conclusions so far (but since I don't have a test lab at my disposal, my observational evidence is limited, and the jury is still out):
It is pretty irrelevant whether operations are performed in the single or double precision domain. As a matter of fact, most functions involved seem to perform slightly faster in their double precision incarnation, even when this requires additional conversions.
The single precision functions without f
suffix (e.g. ilogb
) should be avoided, as they generally perform poorer than their f
suffix counterparts (e.g. ilogbf
).
"bit bashing" is unrivaled in terms of performance. Surprisingly, this also performs better in the 64-bit domain (then again, I'm testing on a 64-bit machine). I'm seeing less than 1 ns per execution. By comparison, my "testbed" itself weighs in at about 15 ns per iteration.
As for implementations of the "pow2(floor(log2))" approach, here's what I'm concluding so far:
I don't see any special combination of the basic building blocks that would give a performance boost from unexpected synergy effects, so it seems reasonable to consider the types of building blocks ("pow2", "floor(log2)" and sign fix) separately.
Presuming the 0.0 case is of little concern, the fastest way to handle sign is to essentially do a "pow2(floor(log2(abs)))" operation, then fix the sign with a simple if(a<0) b=-b;
, being about 5 ns faster than copysign
. If the "pow2" building block has a mantissa-like factor (like ldexp
does), using a comparison to choose between a positive or negative factor is also a viable option, being only slightly slower than the post-operation conditional fix.
By far the worst choice for the "pow2" operation (and one which the software I'm working on has been using for ages in two implementations) is to naively use pow(2.0,x)
. While a compiler could conceivably optimize it into something much faster, mine doesn't. exp2
is about 60 ns faster. ldexp
is another 15 ns faster still, making it the best choice, weighing in at a guesstimated 8-10 ns.
There is an even faster option (also used in the software I'm working on), namely using bit shifts in the integer domain, but it comes at the cost of severely restricting the range of values for which the function works. If this road is to be ventured, the operation should be performed in the long long
domain, as it's only marginally slower than in the int
domain. This approach may save another 4-5 ns.
The slowest "floor(log2)" building block I could find (aside from (int)(log(x)/log(2))
, which I didn't even bother to test) was (int)log2(fabs(x))
and their kin. frexp
is about 30 ns faster, weighing in at a guesstimated 8-10 ns.
If the floating-point type uses a base-2 representation, ilogb
is a viable alternative to frexp
and saves another 1 ns. logb
is slightly slower than ilogb
(on par with frexp
), which makes sense I guess.
All in all, so far the following implementations seem worth considering:
double Pow2Trunc(double a)
{
union { double f; uint64_t i; } hack;
hack.f = a;
hack.i &= 0xFFF0000000000000u;
return hack.f;
}
being the fastest implementation (ca. 1 ns), provided special values are of no concern, the float binary format is known (in this case IEEE binary64), and an int type of same size and byte ordering is available;
double Pow2Trunc(double a)
{
int exp;
(void)frexp(a,&exp);
double b = ldexp(0.5, exp);
if (a < 0) b = -b;
return b;
}
being the fastest fully portable implementation (ca. 16 ns); and maybe
double Pow2Trunc(double a)
{
double b = ldexp(1.0, ilogb(a));
if (a < 0) b = -b;
return b;
}
being a slightly less portable but also slightly faster alternative (ca. 15 ns).
(Handling of special values can presumably be improved; for my use case however they do not matter enough to warrant further examination.)
Providing alternaties based on float
does not seem to be worth the effort; if they are provided, it is important to use the f
-suffixed variants of the functions.
Obviously these results are subject to hardware platform, compiler and settings (i7-5820K, Windows 10 Subsystem for Linux, g++ 5.4.0, -std=gnu++11 -o3 -ffast-math
). Other environments' mileage may vary, and learning about cases where the results are qualitatively different would be most valuable to me.