1

In C++ (presume at least C++11), given a floating point value a, I need to find the floating point value b satisfying the following constraints:

  • b must have the same sign as a.
  • The magnitude of b must be less-or-equal [*] to the magnitude of a.
  • The magnitude of b must be a power of 2.
  • THe magnitude of b must be as large as possible within these constraints.

In other words, I need to "truncate" the magnitude of a to the nearest power of 2, while leaving the sign unchanged.

[* The "less-or-equal" constraint is lax in my case, and "less" would also work.]

Given some IEEE 754 binary representation, one approach to achieving this would be to simply clear all mantissa bits via bit bashing, while leaving the sign and exponent bits unchanged.

A more portable approach would be to:

  1. Get the base-2 logarithm of the magnitude, rounded down, using e.g. logb, ilogb, or even more portable, log2 or frexp.
  2. Raise 2 to the n-th power using e.g. integer bit shifting (beware negative powers and value range issues), pow(2.0,n), exp2(n), or ldexp(1.0,n).
  3. Copy the sign via copysign.

This allows for a lot of possible combinations to solve the task, even more so when also considering the single-precision alternatives. Does anyone have any experience with these approaches regarding performance on modern hardware and using modern compilers?

Christoph Lipka
  • 652
  • 4
  • 15
  • 4
    Remember that C and C++ are two *very* different languages, where even things that might be similar could have minute differences in wording in their respective specifications, making things work very differently from what might be expected. So unless you want to compare the two languages directly then please don't use both tags, and never use the term "C/C++". – Some programmer dude Feb 09 '19 at 22:49
  • 1
    I'd be surprised if C and C++ behave much different performance-wise with respect to these functions, so I'd expect answers from both C and C++ programmers to be relevant. But here you go - no more C, just C++. – Christoph Lipka Feb 09 '19 at 23:29
  • @drescherjm I'd be interested in any observations you can share about the performance of the individual building blocks I mentioned, or combos thereof. For instance, to my amazement I'm just finding out that on my machine `ilogb(float)` is slower than `ilogbf(float)`, which in turn is slower still than converting to double and invoking `ilogb(double)`. Which makes no sense to me, but I can't deny the experimental evidence. – Christoph Lipka Feb 10 '19 at 01:52

2 Answers2

5

Use frexp()1, ldexp()2 to render a and form the answer.

These 2 functions are nearly exactly all that is needed.

The frexp functions break a floating-point number into a normalized fraction and an integral power of 2. ... frexp functions return the value x, such that x has a magnitude in the interval [1/2, 1) or zero.

The ldexp functions multiply a floating-point number by an integral power of 2.

#include <math.h>
#include <stdio.h>

double round_pow2(double a) {
  int exp;
  double frac = frexp(a, &exp);
  if (frac > 0.0) frac = 0.5;
  else if (frac < 0.0) frac = -0.5;
  double b = ldexp(frac, exp);

  printf("% 20g % 25a % 25a", a, a, b);
  printf(" %d", !!signbit(a) == !!signbit(b)); // b must have the same sign as a.
  printf(" %d\n", !(fabs(b) > fabs(a)));       // magnitude `b` must be <= magnitude `a`.

  return b;
}

Test code

void round_pow2_test(double x) {
  round_pow2(nextafter(-x, -INFINITY));
  round_pow2(-x);
  round_pow2(nextafter(-x, INFINITY));
  round_pow2(nextafter(x, -INFINITY));
  round_pow2(x);
  round_pow2(nextafter(x, INFINITY));
}

int main(void) {
  round_pow2_test(0);
  round_pow2_test(DBL_MIN);
  round_pow2_test(1.0);
  round_pow2_test(42.0);
  round_pow2_test(DBL_MAX);
  round_pow2(NAN);
  return 0;
}

Output

   -4.94066e-324                -0x1p-1074                -0x1p-1074 1 1
              -0                   -0x0p+0                   -0x0p+0 1 1
    4.94066e-324                 0x1p-1074                 0x1p-1074 1 1
   -4.94066e-324                -0x1p-1074                -0x1p-1074 1 1
               0                    0x0p+0                    0x0p+0 1 1
    4.94066e-324                 0x1p-1074                 0x1p-1074 1 1
   -2.22507e-308  -0x1.0000000000001p-1022                -0x1p-1022 1 1
   -2.22507e-308                -0x1p-1022                -0x1p-1022 1 1
   -2.22507e-308  -0x1.ffffffffffffep-1023                -0x1p-1023 1 1
    2.22507e-308   0x1.ffffffffffffep-1023                 0x1p-1023 1 1
    2.22507e-308                 0x1p-1022                 0x1p-1022 1 1
    2.22507e-308   0x1.0000000000001p-1022                 0x1p-1022 1 1
              -1     -0x1.0000000000001p+0                   -0x1p+0 1 1
              -1                   -0x1p+0                   -0x1p+0 1 1
              -1     -0x1.fffffffffffffp-1                   -0x1p-1 1 1
               1      0x1.fffffffffffffp-1                    0x1p-1 1 1
               1                    0x1p+0                    0x1p+0 1 1
               1      0x1.0000000000001p+0                    0x1p+0 1 1
             -42     -0x1.5000000000001p+5                   -0x1p+5 1 1
             -42                 -0x1.5p+5                   -0x1p+5 1 1
             -42     -0x1.4ffffffffffffp+5                   -0x1p+5 1 1
              42      0x1.4ffffffffffffp+5                    0x1p+5 1 1
              42                  0x1.5p+5                    0x1p+5 1 1
              42      0x1.5000000000001p+5                    0x1p+5 1 1
            -inf                      -inf                   -0x1p-1 1 1
   -1.79769e+308  -0x1.fffffffffffffp+1023                -0x1p+1023 1 1
   -1.79769e+308  -0x1.ffffffffffffep+1023                -0x1p+1023 1 1
    1.79769e+308   0x1.ffffffffffffep+1023                 0x1p+1023 1 1
    1.79769e+308   0x1.fffffffffffffp+1023                 0x1p+1023 1 1
             inf                       inf                    0x1p-1 1 1
             nan                       nan                       nan 1 1

1 From OP's "Get the base-2 logarithm of the magnitude, rounded down, using e.g. ... frexp."

2 From OP's "Raise 2 to the n-th power using e.g. ... ldexp(1.0,n)."

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • I'm aware of this. But do you have experience how it compares with the other possibilities performance-wise? (For instance, naively I would assume bit bashing to be an even faster approach for the truncate case, requiring just a single AND instruction - but then I have no idea how data transfer between AVX and main CPU registers might counteract that advantage.) – Christoph Lipka Feb 09 '19 at 23:38
  • 1
    @ChristophLipka "Bit bashing" can be faster with a subset of `double`, yet _certainly_ slower when the entire range of `double` must be handle correctly (AND will not do it) as the "bit bashing" user code then replicates the library `frexp, ldexp`. On a select platform and library a `round2power2(x)` may already exist - but then that is not portable per the standard library. If you truly want something "faster", you need to code the performance rating tool (average of all x, worst case, ...), subset of `double` if that is allowed and target platform/compiler. The above is your baseline. – chux - Reinstate Monica Feb 09 '19 at 23:49
  • 1
    @ChristophLipka Tale care about posting a moving target question. Once answers arrive, it is poor SO etiquette to change the goals (languages, general to platform specific, etc.) It is good to elaborate, provide missing detail. Be mindful the things others aware of are base on the question and not unwritten/unspecific intentions. – chux - Reinstate Monica Feb 09 '19 at 23:57
  • @ChristophLipka Note: "simply clear all mantissa bits" fails for sub-normals and NAN. Those need other handling. Depending on how bits are cleared for others values, code can run into anti-aliasing issues and certainly portability issues. You could code up a solution **and** test harness and then ask for improvements: "how to make faster given certain conditions" - perhaps on here or [Code Review](https://codereview.stackexchange.com/) – chux - Reinstate Monica Feb 10 '19 at 00:04
  • @ChristophLipka Lastly, my decades of experience point the optimization in narrow cases is far too often [pre-mature optimization](https://en.wikipedia.org/wiki/Program_optimization#When_to_optimize). Far better to 1) Start with a portably full range working solution and then 2) use "tricks" only when demonstrably necessary. Better to use my programmer skills to improve overall system efficiency and not minor linear improvements as suggested here with a mask. – chux - Reinstate Monica Feb 10 '19 at 00:13
  • "it is poor SO etiquette to change the goals" No goals have been changed, except that I dropped reference to C at the suggestion of another SO user (superior in reputation). Question has _explicitly_ been about performance, not how-to, right from the start; I just retroactively applied emphasis after getting the impression that some readers were taking a TL;DR approach to reading questions, and missing this key detail. – Christoph Lipka Feb 10 '19 at 00:37
  • @ChristophLipka: I am not surprised that chux was annoyed with you! After such a comprehensive and thoughtful answer, you say "I'm aware of this." Well why didn't you say so to start with? What a waste of chux's time! You should be apologising. – TonyK Feb 10 '19 at 01:32
  • @TonyK Why? Because they failed to read my question to the end and thus their answer totally missed the point? (:shrugs:) – Christoph Lipka Feb 10 '19 at 02:37
  • @ChristophLipka The question was read in its entirety and had various conflicting unclear issues. As the author, it may have been clear to you yet the initial "I need to find the floating point value b satisfying the following constraints:" appeared paramount and this answer focused on that. Not the trailing "Does anyone have any experience with these approaches regarding performance on modern hardware and using modern compilers?". The trailing question's direct answer is : "yes, I do". Best to put the most important aspect of your question first, not last . – chux - Reinstate Monica Feb 10 '19 at 03:55
  • @chux You mean, like, in the subject of the question? – Christoph Lipka Feb 10 '19 at 04:16
  • @ChristophLipka Yes. Question lacked any performance benchmark details nor provided a baseline. To rate performance, one needs a baseline as provided here. To consider alternatives, parameters of range of floating point needed (assumed full range), platform/complier (assumed generic,C99 compliant). Example, to even [double a FP](https://stackoverflow.com/q/54610832/2410359) value takes a fair amount of tests for the full range of a floating point type. "one approach to achieving this would be to simply clear all mantissa bits via bit bashing," is insufficient for this question's goal. – chux - Reinstate Monica Feb 10 '19 at 04:26
  • 1
    Technically, there is no solution if `a` is zero, since no `b` satisfies the constraints of the problem. If we do not care what is returned when `a` is zero, then `frac = copysign(.5, frac);` suffices and might perform better than the `if…else if…`. – Eric Postpischil Feb 10 '19 at 05:04
  • Since I'm after a relative comparison between potentially vastly different approaches, I don't see the benefit of specifying a particular approach as a baseline. If anything at all, I'd exopect it to discourage people from posting their personal experience with approaches X and Y because I gave Z as the baseline which they happen to not have any experience with. Specifying a particular platform or compiler would also be counter-productive, as the project I'm working on is supposed to be portable (but choosing between alternative optimized implementations at compile time is an option). – Christoph Lipka Feb 10 '19 at 05:13
  • 1
    @EricPostpischil From the tests I'm currently conducting, it seems that `ldexp((a<0?-0.5:0.5),exp)` will be faster than `kdexp(copysign(0.5,a),exp)`, and on par with `b=ldexp(0.5,exp); if(a<0) b=-b;` -- on my machine anyway. Which makes sense I think, because `copysign` must deal with arbitrary mantissas and exponents, while in this case the mantissa and exponent are fixed. – Christoph Lipka Feb 10 '19 at 05:31
1

From my own tests, I'm coming to the following conclusions so far (but since I don't have a test lab at my disposal, my observational evidence is limited, and the jury is still out):

  • It is pretty irrelevant whether operations are performed in the single or double precision domain. As a matter of fact, most functions involved seem to perform slightly faster in their double precision incarnation, even when this requires additional conversions.

  • The single precision functions without f suffix (e.g. ilogb) should be avoided, as they generally perform poorer than their f suffix counterparts (e.g. ilogbf).

  • "bit bashing" is unrivaled in terms of performance. Surprisingly, this also performs better in the 64-bit domain (then again, I'm testing on a 64-bit machine). I'm seeing less than 1 ns per execution. By comparison, my "testbed" itself weighs in at about 15 ns per iteration.

As for implementations of the "pow2(floor(log2))" approach, here's what I'm concluding so far:

  • I don't see any special combination of the basic building blocks that would give a performance boost from unexpected synergy effects, so it seems reasonable to consider the types of building blocks ("pow2", "floor(log2)" and sign fix) separately.

  • Presuming the 0.0 case is of little concern, the fastest way to handle sign is to essentially do a "pow2(floor(log2(abs)))" operation, then fix the sign with a simple if(a<0) b=-b;, being about 5 ns faster than copysign. If the "pow2" building block has a mantissa-like factor (like ldexp does), using a comparison to choose between a positive or negative factor is also a viable option, being only slightly slower than the post-operation conditional fix.

  • By far the worst choice for the "pow2" operation (and one which the software I'm working on has been using for ages in two implementations) is to naively use pow(2.0,x). While a compiler could conceivably optimize it into something much faster, mine doesn't. exp2 is about 60 ns faster. ldexp is another 15 ns faster still, making it the best choice, weighing in at a guesstimated 8-10 ns.

  • There is an even faster option (also used in the software I'm working on), namely using bit shifts in the integer domain, but it comes at the cost of severely restricting the range of values for which the function works. If this road is to be ventured, the operation should be performed in the long long domain, as it's only marginally slower than in the int domain. This approach may save another 4-5 ns.

  • The slowest "floor(log2)" building block I could find (aside from (int)(log(x)/log(2)), which I didn't even bother to test) was (int)log2(fabs(x)) and their kin. frexp is about 30 ns faster, weighing in at a guesstimated 8-10 ns.

  • If the floating-point type uses a base-2 representation, ilogb is a viable alternative to frexp and saves another 1 ns. logb is slightly slower than ilogb (on par with frexp), which makes sense I guess.


All in all, so far the following implementations seem worth considering:

double Pow2Trunc(double a)
{
    union { double f; uint64_t i; } hack;
    hack.f = a;
    hack.i &= 0xFFF0000000000000u;
    return hack.f;
}

being the fastest implementation (ca. 1 ns), provided special values are of no concern, the float binary format is known (in this case IEEE binary64), and an int type of same size and byte ordering is available;

double Pow2Trunc(double a)
{
    int exp;
    (void)frexp(a,&exp);
    double b = ldexp(0.5, exp);
    if (a < 0) b = -b;
    return b;
}

being the fastest fully portable implementation (ca. 16 ns); and maybe

double Pow2Trunc(double a)
{
    double b = ldexp(1.0, ilogb(a));
    if (a < 0) b = -b;
    return b;
}

being a slightly less portable but also slightly faster alternative (ca. 15 ns).

(Handling of special values can presumably be improved; for my use case however they do not matter enough to warrant further examination.)

Providing alternaties based on float does not seem to be worth the effort; if they are provided, it is important to use the f-suffixed variants of the functions.


Obviously these results are subject to hardware platform, compiler and settings (i7-5820K, Windows 10 Subsystem for Linux, g++ 5.4.0, -std=gnu++11 -o3 -ffast-math). Other environments' mileage may vary, and learning about cases where the results are qualitatively different would be most valuable to me.

Christoph Lipka
  • 652
  • 4
  • 15