Round Down a Floating Point Result

Question

I have two floating point (double) values a and b, and I'm looking to add them to get a result c.

I know that c will somehow be approximated, because everything is finite precision. Now, I want to 'round down' c, meaning that the floating point c is not greater than the real sum of floating points a and b, or c <= a + b.

How can I do this? The following code in c comes to mind, but I'm not sure whether the answer will be what I want.

c = nextafter(a + b, bigNegativeNumber)

Same question goes for multiplication instead of addition. :)

PS. If it helps, a and b are always non-negative numbers.

Edit: c should be a floating point as well

At least on some architectures you may be able to set the rounding mode to round towards zero or negative infinity: `IEEE 754 1985: 4.2. Directed Roundings An implementation shall also provide three user-selectable directed rounding modes: round toward +INFINITY, round toward – INFINITY, and round toward 0.` — EOF, Jul 15 '15 at 23:39
Same as Weather Vane, but the longer route, `c = floor(-1 * (a + b))*-1`; if there was some issue with decimal rounding. — Evan Carslake, Jul 15 '15 at 23:40
@user3528438 it is floating point type, not integer. If it were integer there would be no question to begin with. — Evan Carslake, Jul 15 '15 at 23:43
@WeatherVane I edited the post, I'm interested in finding a floating point value, not an integer one. Sorry for being vague. — drasko, Jul 15 '15 at 23:45
If I previously misunderstood your question, I think you will find that any lack in precision will result in `c` being always `<= a + b` when `a` and `b` are positive. — Weather Vane, Jul 15 '15 at 23:46
@WeatherVane: ...unless the floating-point implementation is in round-to-nearest- or round-towards-(positive-)infinity-mode? — EOF, Jul 15 '15 at 23:48
@WeatherVane ah I misread, now looking, also wondering how two positives could be negative, if added. — Evan Carslake, Jul 15 '15 at 23:48
@EOF: Looks like what I'm looking for. I found a how-to at [link](http://stackoverflow.com/questions/6867693/change-floating-point-rounding-mode) Thanks! — drasko, Jul 15 '15 at 23:50
@EOF thanks yes, just been looking at MSVC `_control87` which can specify a flag `_MCW_RC` for Rounding control. https://msdn.microsoft.com/en-us/library/e9b52ceh.aspx — Weather Vane, Jul 15 '15 at 23:55

njuffa · Accepted Answer · 2015-07-16T00:02:22.160

Based on your description, it seems like you want to control the rounding mode for a floating-point operation. This is supported in C99 by the functionality provided in the header file fenv.h. You may need to instruct your compiler to turn on C99 support, and you may need to instruct it to perform floating-point arithmetic in an IEEE-754 compliant way. Below is a minimal example showing how to perform double addition with truncation (rounding towards zero). Since your operands are known to be positive, this is equivalent to rounding down (towards negative infinity).

#include <stdio.h>
#include <stdlib.h>
#include <fenv.h>

#pragma STDC FENV_ACCESS ON

double dadd_rz (double a, double b) 
{
    double res;
    int orig_mode = fegetround ();
    fesetround (FE_TOWARDZERO);  // set rounding mode to truncate
    res = a + b;
    fesetround (orig_mode);      // restore rounding mode
    return res;
}

int main (void)
{
    double a = 0x1.fffffffffffffp1023;
    printf ("                  a = %20.13a\n", a);
    printf ("                a+a = %20.13a\n", a + a);
    printf ("round_to_zero (a+a) = %20.13a", dadd_rz (a, a));
    return EXIT_SUCCESS;
}

The output of the above program should look something like this (note that the printing of infinities is implementation dependent):

                  a = 0x1.fffffffffffffp+1023
                a+a = 0x1.#INF000000000p+0
round_to_zero (a+a) = 0x1.fffffffffffffp+1023

score 0 · Answer 2 · edited May 23 '17 at 12:21

A tricky problem.

@EOF's comment above to "round toward 0" is good and will provide an optimal result.

#ifdef _ _STDC_IEC_559_ _ 
    fesetround(FE_DOWNWARD);
    c = a + b;
#else
   #error unable to set rounding mode
#endif

OP's original approach is close too. Any good compilation/processor should be expected to create the best answer to with in 0.5 or 1.0 ULP (depending on rounding mode). It will certainly create a sum c2 less than arithmetic a+b, but c may meet the requirements too.

c = a + b
c2 = nextafter(c, -DBL_MAX);

c = floor(a + b) will not work as a could be far larger in magnitude than some small negative b such that the computed sum is simple still a and fails the arithmetic c <= a + b.

Round Down a Floating Point Result

2 Answers2