16

A float (a.k.a. single) value is a 4-byte value, and supposed to represent any real-valued number. Because of the way it is formatted and the finite number of bytes it is made off, there is a minimum value and a maximum value it can represent, and it has a finite precision, depending on it's own value.

I would like to know if there is a way to get the closest possible value above or below some reference value, given the finite precision of a float. With integers, this is trivial: one simply adds or subtracts 1. But with a float, you can't simply add or subtract the minimum float value and expect it to be different from your original value. I.e.

float FindNearestSmaller (const float a)
{
    return a - FLT_MIN; /* This doesn't necessarily work */
}

In fact, the above will almost never work. In the above case, the return will generally still equal a, as the FLT_MIN is far beyond the precision of a. You can easily try this out for yourself: it works for e.g. 0.0f, or for very small numbers of order FLT_MIN, but not for anything between 0 and 100.

So how would you get the value that is closest but smaller or larger than a, given floating point precision?

Note: Though i am mainly interested in a C/C++ answer, I assume the answer will be applicable for most programming languages.

Yellow
  • 3,955
  • 6
  • 45
  • 74
  • Really curious to see an answer to this. – Almo Sep 15 '14 at 13:58
  • `memcpy` into an integer, add one, `memcpy` back into `float` ? – Bart Friederichs Sep 15 '14 at 13:58
  • @BartFriederichs: no, that won't work in all cases. – Paul R Sep 15 '14 at 14:00
  • Also, http://stackoverflow.com/a/7408487 – Hasturkun Sep 15 '14 at 14:06
  • with float one simply adds one or removes one to the mantissa (and normalizes) to find the next nearest – old_timer Sep 15 '14 at 14:52
  • There are already [nextafter](http://en.cppreference.com/w/cpp/numeric/math/nextafter) for you to use http://stackoverflow.com/questions/16322465/finding-the-closest-floating-point-value-less-than-a-specific-integer-value-in-c http://stackoverflow.com/questions/11159774/c-next-float-with-numeric-limits-epsilon http://stackoverflow.com/questions/10160079/how-to-find-nearest-next-previous-double-value-numeric-limitsepsilon-for-give – phuclv Sep 15 '14 at 15:24

3 Answers3

14

The standard way to find a floating-point value's neighbors is the function nextafter for double and nextafterf for float. The second argument gives the direction. Remember that infinities are legal values in IEEE 754 floating-point, so you can very well call nextafter(x, +1.0/0.0) to get the value immediately above x, and this will work even for DBL_MAX (whereas if you wrote nextafter(x, DBL_MAX), it would return DBL_MAX when applied for x == DBL_MAX).

Two non-standard ways that are sometimes useful are:

  1. access the representation of the float/double as an unsigned integer of the same size, and increment or decrement this integer. The floating-point format was carefully designed so that for positive floats, and respectively for negative floats, the bits of the representation, seen as an integer, evolve monotonously with the represented float.

  2. change the rounding mode to upward, and add the smallest positive floating-point number. The smallest positive floating-point number is also the smallest increment that there can be between two floats, so this will never skip any float. The smallest positive floating-point number is FLT_MIN * FLT_EPSILON.


For the sake of completeness, I will add that even without changing the rounding mode from its “to nearest” default, multiplying a float by (1.0f + FLT_EPSILON) produces a number that is either the immediate neighbor away from zero, or the neighbor after that. It is probably the cheapest if you already know the sign of the float you wish to increase/decrease and you don't mind that it sometimes does not produce the immediate neighbor. Functions nextafter and nextafterf are specified in such a way that a correct implementation on the x86 must test for a number of special values and FPU states, and is thus rather costly for what it does.

To go towards zero, multiply by 1.0f - FLT_EPSILON.

This doesn't work for 0.0f, obviously, and generally for the smaller denormalized numbers.

The values for which multiplying by 1.0f + FLT_EPSILON advance by 2 ULPS are just below a power of two, specifically in the interval [0.75 * 2p … 2p). If you don't mind doing a multiplication and an addition, x + (x * (FLT_EPSILON * 0.74)) should work for all normal numbers (but still not for zero nor for all the small denormal numbers).

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
11

Look at the "nextafter" function, which is part of Standard C (and probably C++, but I didn't check).

gnasher729
  • 51,477
  • 5
  • 75
  • 98
0

I tried it out on my machine. And all three approaches:
1. adding with 1 and memcopying
2. adding FLT_EPSILON
3. multiplying by (1.0f + FLT_EPSILON)
seems to give the same answer.


see the result here
bash-3.2$ cc float_test.c -o float_test; ./float_test 1.023456 10
Original num: 1.023456
int added = 1.023456 01-eps added = 1.023456 mult by 01*(eps+1) = 1.023456
int added = 1.023456 02-eps added = 1.023456 mult by 02*(eps+1) = 1.023456
int added = 1.023456 03-eps added = 1.023456 mult by 03*(eps+1) = 1.023456
int added = 1.023456 04-eps added = 1.023456 mult by 04*(eps+1) = 1.023456
int added = 1.023457 05-eps added = 1.023457 mult by 05*(eps+1) = 1.023457
int added = 1.023457 06-eps added = 1.023457 mult by 06*(eps+1) = 1.023457
int added = 1.023457 07-eps added = 1.023457 mult by 07*(eps+1) = 1.023457
int added = 1.023457 08-eps added = 1.023457 mult by 08*(eps+1) = 1.023457
int added = 1.023457 09-eps added = 1.023457 mult by 09*(eps+1) = 1.023457
int added = 1.023457 10-eps added = 1.023457 mult by 10*(eps+1) = 1.023457

Code
#include <float.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>

int main(int argc, char *argv[])
{

    if(argc != 3) {
        printf("Usage: <binary> <floating_pt_num> <num_iter>\n");
        exit(0);
    }

    float f = atof(argv[1]);
    int count = atoi(argv[2]);

    assert(count > 0);

    int i;
    int num;
    float num_float;

    printf("Original num: %f\n", f);
    for(i=1; i<=count; i++) {
        memcpy(&num, &f, 4);
        num += i;
        memcpy(&num_float, &num, 4);
        printf("int added = %f \t%02d-eps added = %f \tmult by %2d*(eps+1) = %f\n", num_float, i, f + i*FLT_EPSILON, i, f*(1.0f + i*FLT_EPSILON));
    }

    return 0;
}
Nikhil Vidhani
  • 709
  • 5
  • 11
  • 1
    `%.9f` is better than `%f` to show what is happening, and `%a` is even better. – Pascal Cuoq Sep 15 '14 at 17:58
  • Suggest testing against numbers on just either side of a power-of-2 (and away from 1.0, maybe `-`) like -1023.9 and -1024.1. Confident significant differences will be shown. – chux - Reinstate Monica Sep 15 '14 at 18:14
  • @chux well that is expected. With larger numbers multiplying by `1+FLT_EPSILON` will give a larger drift. Also the add-by-1 integer method will also affect the mantissa by a larger amount; in some cases like `2^30` changing the exponent as well. But adding by `FLT_EPSILON` still produces consistent results. Actually now the changes are even slower because we are changing bigger numbers; so effect of small `FLT_EPSILON` is shadowed. – Nikhil Vidhani Sep 15 '14 at 18:33
  • 1
    `f + i*FLT_EPSILON` will clearly not work properly for values of `f` larger than 4. No one said that `f*(1.0f + i*FLT_EPSILON)` worked for `i` larger than 1. Interestingly, the `num += i;` method does work for arbitrary values of `i` as long as the float does not overflow. – Pascal Cuoq Sep 15 '14 at 18:34
  • @PascalCuoq Actually, `num += i` won't give you the correct answer when the test number gets bigger as I already explained that adding `1` will have a magnified effect on the test number. For instance, with `256` as test number, I get `256.000031` as the next number which can't be correct as there are at least `30` other visible floats in between. – Nikhil Vidhani Sep 15 '14 at 18:46
  • 1
    The next larger `float` after `256.0f` is `256.0000305...f`. [Ref: binary32](http://en.wikipedia.org/wiki/Single-precision_floating-point_format) Adding 1 _should_ have a magnified effect as `f` is larger. Disagree that there are at least 30 visible floats in between. – chux - Reinstate Monica Sep 15 '14 at 18:56
  • 1
    "Adding 1" will work for typical `float` ([binary32](http://en.wikipedia.org/wiki/Single-precision_floating-point_format)) from `+0.0f` to `FLT_MAX`. – chux - Reinstate Monica Sep 15 '14 at 19:03
  • 1
    What makes you think that there are “at least 30 other visible floats” between 256 and 256.000031? What are the bit representations of these floats? What do they print when passed to `printf("%.9e\n", (float) X)`? – Pascal Cuoq Sep 15 '14 at 19:06
  • @chux corrected the priting issue. though i the test results were from updated code; so they were correct. BTW can't I declare `float f = 256.00001` ? – Nikhil Vidhani Sep 15 '14 at 19:07
  • @chux yes you are right... printing `256.000001` prints `256.00000` and printing `256.00002` prints `256.000031`. Thanks for clarifying to me. – Nikhil Vidhani Sep 15 '14 at 19:13
  • 2
    @Nikhil Vidhani Note: Using to `printf("%f", f)` to print the next floating point number is problematic for large and small `float` due to the non-exponential notation and limited number of decimal places. To get a clearer idea of the nature of the next `float`, use `printf("%.*e", FLT_DECIMAL_DIG - 1, f)`. This will print `FLT_DECIMAL_DIG` digits, guaranteed by the C spec just precise enough to show the difference between any two adjacent `float`s. – chux - Reinstate Monica Sep 15 '14 at 19:22