2

I have some places in my code where I want to assure that a division of 2 arbitrary floating point numbers (32 bit single precision) won't overflow. The target/compiler does not guarantee (explicitly enough) nice handling of -INF/INF and (does not fully guarantees IEEE 754 for the exceptional values - (possibly undefined) - and target might change). Also I cannot make save assumtions on the inputs for this few special places and I am bound to C90 standard libraries.

I have read What Every Computer Scientist Should Know About Floating-Point Arithmetic but to be honest, I am a little bit lost.

So... I want to ask the community, if the following piece of code would do the trick, and if there are better/faster/exacter/correcter ways to do it:

#define SIGN_F(val) ((val >= 0.0f)? 1.0f : -1.0f)

float32_t safedivf(float32_t num, float32_t denum)
{
   const float32_t abs_denum = fabs(denum);
   if((abs_denum < 1.0f) && ((abs_denum * FLT_MAX) <= (float32_t)fabs(num))
       return SIGN_F(denum) * SIGN_F(num) * FLT_MAX;
   else
       return num / denum;
}

Edit: Changed ((abs_denum * FLT_MAX) < (float32_t)fabs(num)) to ((abs_denum * FLT_MAX) <= (float32_t)fabs(num)) as recommeded by Pascal Cuoq.

Mark A.
  • 579
  • 4
  • 13
  • 1
    Goldberg's paper loses most people on a first reading, it is a dreadful resource for a beginner learning how to use f-p numbers and arithmetic. It is targeted at people designing software and hardware for implementing f-p numbers and algorithms. You would have been better starting at http://en.wikipedia.org/wiki/Floating_point – High Performance Mark Aug 14 '14 at 14:08
  • 1
    Surely better to ask forgiveness than permission – David Heffernan Aug 14 '14 at 14:10
  • 1
    `SIGN_F(val)` does not correctly extract the sign of `-0.0`, so `1.0 / -0.0` is going to end up as `+FLT_MAX`. You may or may not mind. If you aren't sure of IEEE 754, perhaps you cannot be sure of `copysign` either, but if it is there you might as well use it. – Pascal Cuoq Aug 14 '14 at 14:15
  • @PascalCuoq Ah your are right. Is there any chance to extract the sign of a value that is possibly -0.0f with the set of functions specified in C89/90 without trying do interfere with the floating point bit patterns? Correct me if I renember wrong: -0.0 compares exact like 0.0? – Mark A. Aug 14 '14 at 14:33
  • 1
    @MarkA. Yes, for `==`, `>`, etc., the signed zeroes are undistinguishable. You must 1- use `copysign`, or 2- access the representation through a `union`, or 3- divide by it and cause the very overflow that you are trying to avoid. That leaves you with solution 2-. – Pascal Cuoq Aug 14 '14 at 14:39
  • @PascalCuoq As I thought. I fear I will be harder to get guarantees on the bit patterns then on the exceptional value behavior. Thanks. – Mark A. Aug 14 '14 at 14:44
  • @DavidHeffernan I don't understand your comment. – Mark A. Aug 14 '14 at 15:02
  • 1
    Attempt to perform the divide. If it fails, detect that failure and react accordingly. Trying to predict the failure ahead of time is just as hard, in fact probably harder, than going ahead and performing the division. – David Heffernan Aug 14 '14 at 15:04
  • @DavidHeffernan Ah. Thanks for the input. I this special case, we don't trust the target very much to trap the error and that I have the possibility to recover from - as the specifications below the C-language level are a little bit in a flow. I have to assume that an overflow will lead to undefined bahavior, therefore I have to ask for permission. :-) – Mark A. Aug 14 '14 at 15:12
  • 1
    @Pascal Cuoq A good `atan2f(y,x)` distinguishes between `+0.0` and `-0.0` with `atan2f(zero, -1.0)` which will return `+pi` or `-pi` depending on `+0`, `-0`. – chux - Reinstate Monica Aug 14 '14 at 15:43
  • instead of defining `SIGN_F` you can use `signbit` or `copysign` [Is there a standard sign function (signum, sgn) in C/C++?](https://stackoverflow.com/q/1903954/995714) – phuclv Apr 09 '19 at 13:26

4 Answers4

4

You may try to extract the exponents and the mantissas of num and denum, and make sure that condition:

((exp(num) - exp (denum)) > max_exp) &&  (mantissa(num) >= mantissa(denum))

And according to the sign of the inputs, generate the corresponding INF.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
Garp
  • 109
  • 5
  • Not sure if that's the exactly right formula, but it's the right approach. – Hot Licks Aug 14 '14 at 18:18
  • I think it is correct as long as you use round to nearest mode. – Garp Aug 14 '14 at 18:34
  • 2
    If `exp(num) - exp(denum)` is significantly larger than `max_exp`, the values of the mantissa are irrelevant and overflow still occurs. Also the exponent of `(0.0)` is typically `0` and then `num == 0` or `denum == 0` need special handling. – chux - Reinstate Monica Aug 14 '14 at 18:59
1

In ((abs_denum * FLT_MAX) < (float32_t)fabs(num), the product abs_denum * FLT_MAX may round down and end up equal to fabs(num). This does not mean that num / denum is not mathematically larger than FLT_MAX, and you should be worried that it might happen to cause the overflow that you want to avoid. You had better replace this < by <=.


For an alternative solution, if a double type is available and is wider than float, it may be more economical to compute (double)num/(double)denum. If float is binary32ish and double is binary64ish, the only way the double division can overflow is if denum is (a) zero (and if denum is a zero your code is also problematic).

double dbl_res = (double)num/(double)denum;
float res = dbl_res < -FLT_MAX ? -FLT_MAX : dbl_res > FLT_MAX ? FLT_MAX : (float)dbl_res;
Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • Can you please point out, where/if my (corrected) code is problematic in the case of 0.0, besides the sign-issue? – Mark A. Aug 14 '14 at 15:26
  • @MarkA. I was only referring to the sign issue. – Pascal Cuoq Aug 14 '14 at 15:30
  • Downvoter, any explanation why you find this answer particularly bad? – Pascal Cuoq Aug 18 '14 at 11:46
  • I am also surprised, because this solution - together with some -0.0 sign detector seemed to be the one to go for us. I can't see any further pitfalls, given that our floats and doubles hold the condition. But this can be assured by preprocessor. What is the reason for downvote? – Mark A. Aug 19 '14 at 05:48
  • Hm... Seems that there is nothing additional to say for the downvote... if you also haven't a idea in meanwhile why there might be a problem. I would go for your answer. – Mark A. Sep 06 '14 at 14:58
  • @MarkA. The downvote may come from someone in the “floating-point results are random” school of thought, to whom the idea of discussing the nuance between `<` and `<=`, instead of comparing “up to epsilon”, whatever that means in this case (add epsilon or subtract it? what value of epsilon?), might be anathema. – Pascal Cuoq Sep 06 '14 at 15:12
  • @MarkA. Why might I be right when I suggest a simple, non-epsilon-involving solution? I started working on the art of predicting what a C program using floating-point can do in 2009 (http://hisseo.saclay.inria.fr/index.html ), and I haven't really stopped since. – Pascal Cuoq Sep 06 '14 at 15:18
1

Carefully work with num, denom when the quotient is near FLT_MAX.

The following uses tests inspired by OP but stays away from results near FLT_MAX. As @Pascal Cuoq points out that rounding may just push the result over the edge. Instead it uses thresholds of FLT_MAX/FLT_RADIX and FLT_MAX*FLT_RADIX.

By scaling with FLT_RADIX, typically 2, code should always get exact results. Rounding under any rounding mode is not expected to infect the result.

In terms of speed, the "happy path", that is, when results certainly do not overflow should be a speedy calculation. Still need to do unit testing, but the comments should provide the gist of this approach.

static int SD_Sign(float x) {
  if (x > 0.0f)
    return 1;
  if (x < 0.0f)
    return -1;
  if (atan2f(x, -1.0f) > 0.0f)
    return 1;
  return -1;
}

static float SD_Overflow(float num, float denom) {
  return SD_Sign(num) * SD_Sign(denom) * FLT_MAX;
}

float safedivf(float num, float denom) {
  float abs_denom = fabsf(denom);
  // If |quotient| > |num|
  if (abs_denom < 1.0f) {
    float abs_num = fabsf(num);
    // If |num/denom| > FLT_MAX/2 --> quotient is very large or overflows
    // This computation is safe from rounding and overflow.
    if (abs_num > FLT_MAX / FLT_RADIX * abs_denom) {
      // If |num/denom| >= FLT_MAX*2 --> overflow
      // This also catches denom == 0.0
      if (abs_num / FLT_RADIX >= FLT_MAX * abs_denom) {
        return SD_Overflow(num, denom);
      }
      // At this point, quotient must be in or near range FLT_MAX/2 to FLT_MAX*2
      // Scale parameters so quotient is a FLT_RADIX * FLT_RADIX factor smaller.
      if (abs_num > 1.0) {
        abs_num /= FLT_RADIX * FLT_RADIX;
      } else {
        abs_denom *= FLT_RADIX * FLT_RADIX;
      }
      float quotient = abs_num / abs_denom;
      if (quotient > FLT_MAX / (FLT_RADIX * FLT_RADIX)) {
        return SD_Overflow(num, denom);
      }
    }
  }
  return num / denom;
}

The SIGN_F() needs to consider in denum is +0.0 or -0.0. Various methods mentioned by @Pascal Cuoq in a comment:

  1. copysign() or signbit()
  2. Use a union

Additional, some functions, when well implemented, differentiate on +/- zero like atan2f(zero, -1.0) and sprintf(buffer, "%+f", zero).

Note: Used float vs. float32_t for simplicity.
Note: Maybe use fabsf() rather than fabs().
Minor: Suggest denom (denominator) in lieu of denum.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • Wikipedia should have a “list of C functions that distinguish +0.0 and -0.0”. – Pascal Cuoq Aug 15 '14 at 16:25
  • 1
    Thank you for this elaborate answer. The atan2-trick is clever. Some remarks from me: --- I think the math.h xxxf functions like fabsf are first introduced with C99. I'm am sure about atan2f regarding this. Same holds for copysign and signbit. --- For the union solution, I will go for this for final optimizations, when for the final compiler settings I can get the assured the bit pattern by our compiler ventor. --- For our application we have strict runtime guarantees to give so we are in an "average case == worst case" design, whereever the code becomes simpler. So we have no "happy path". – Mark A. Aug 19 '14 at 06:06
  • @Mark A. For alternate ways to distinguish +0.0 from -0.0, see http://stackoverflow.com/questions/25332133/what-operations-and-functions-on-0-0-and-0-0-give-different-arithmetic-results. Add solutions there if you find another. – chux - Reinstate Monica Aug 19 '14 at 11:21
0

To avoid the corner cases with rounding and what not, you could massage the exponent on the divisor -- with frexp() and ldexp() -- and worry about whether the result can be scaled back without overflow. Or frexp() both arguments, and do the exponenent work by hand.