Understanding the maximum values that can be stored in floats in C

Question

I have come across some behaviour with the float type in C that I do not understand, and was hoping might be explained. Using the macros defined in float.h I can determine the maximum/minimum values that the datatype can store on the given hardware. However when performing a calculation that should not exceed these limits, I find that a typed float variable fails where a double succeeds. The following is a minimal example, which compiles on my machine.

#include <stdio.h>
#include <stdlib.h>
#include <float.h>

int main(int argc, char **argv)
{
    int gridsize;
    long gridsize3;

    float *datagrid;

    float sumval_f;
    double sumval_d;

    long i;

    gridsize = 512;
    gridsize3 = (long)gridsize*gridsize*gridsize;

    datagrid = calloc(gridsize3, sizeof(float));
    if(datagrid == NULL)
    {
        free(datagrid);
        printf("Memory allocation failed\n");
        exit(0);
    }

    for(i=0; i<gridsize3; i++)
    {
        datagrid[i] += 1.0;
    }

    sumval_f = 0.0;
    sumval_d = 0.0;
    for(i=0; i<gridsize3; i++)
    {
        sumval_f += datagrid[i];
        sumval_d += (double)datagrid[i];
    }

    printf("\ngridsize3 = %e\n", (float)gridsize3);
    printf("FLT_MIN = %e\n", FLT_MIN);
    printf("FLT_MAX = %e\n", FLT_MAX);
    printf("DBL_MIN = %e\n", DBL_MIN);
    printf("DBL_MAX = %e\n", DBL_MAX);

    printf("\nfloat sum = %f\n", sumval_f);
    printf("double sum = %lf\n", sumval_d);
    printf("sumval_d/sumval_f = %f\n\n", sumval_d/(double)sumval_f);

    free(datagrid);
    return(0);
}

Compiling with gcc I find the output:

gridsize3 = 1.342177e+08
FLT_MIN = 1.175494e-38
FLT_MAX = 3.402823e+38
DBL_MIN = 2.225074e-308
DBL_MAX = 1.797693e+308

float sum = 16777216.000000
double sum = 134217728.000000
sumval_d/sumval_f = 8.000000

Whilst compiling with icc the sumval_f = 67108864.0 and hence the final ratio is instead 2.0*. Note that the float sum is incorrect, whilst the double sum is correct.

As far as I can tell the output of FLT_MAX suggests that the sum should fit into a float, and yet it seems to plateau out at either an eighth or a half of the full value.

Is there a compiler specific override to the values found using float.h? Why is a double required to correctly find the sum of this array?

_{*Interestingly the inclusion of an if statement inside the for loop that prints values of the array causes the value to match the gcc output, i.e. an eighth of the correct sum, rather than a half.}

That's an even more minimal example: https://ideone.com/EcSqxR . Floating point numbers have a limited precision, at some point, 1.0 became too small to be added to a float. — Bob__, Oct 12 '17 at 20:58
Bob__'s example says adding `1` to float `16777216` makes no difference. It's not a just a matter of range, but significance. Please [read this](https://stackoverflow.com/questions/2100490/floating-point-inaccuracy-examples) — Weather Vane, Oct 12 '17 at 21:02
Reopened. The "canonical" duplicate is more to do with the mathematics of floating point numbers; this one is more to do with their dispersion. Is there a better duplicate? — Bathsheba, Oct 12 '17 at 21:10

score 5 · Accepted Answer · edited Oct 12 '17 at 21:05

5

The problem here isn't the range of values but the precision.

Assuming a 32-bit IEEE754 float, this datatype has a maximum of 24 bits of precision. This means that not all integers larger than 16777216 can be represented exactly.

So when your sum reaches 16777216, adding 1 to it is outside the precision of what the datatype can store, so the number doesn't get any bigger.

A (presumably) 64-bit double has 53 bits of precision. This is enough bits to hold all integer values up to your sum of 134217728, so it gives you an accurate result.

edited Oct 12 '17 at 21:05

Bathsheba

231,907
34
361
483

answered Oct 12 '17 at 21:01

dbush

205,898
23
218
273

I am surprised that you answer this nth time dupe. – 0___________ Oct 12 '17 at 21:05
Why is the behaviour different with the `icc` compiler, i.e. able to reach a larger value? – Oct 12 '17 at 21:18
2

@lewis C is allowed to perform intermediate calculations at a higher precision. This is implementation defined. Research `FLT_EVAL_METHOD` – chux - Reinstate Monica Oct 12 '17 at 21:21
BTW I suspect adding 1 (or small n) to 16777216 may get a bigger sum due to round-to-nearest, yet eventually adding 1 will not round up. – chux - Reinstate Monica Oct 12 '17 at 21:24
@Lewis the standard allows for `sumval_f + datagrid[i]` to be computed in whatever precision register the compiler has available; although the result must then be downgraded to `float` after each addition for storing back into `sumval_f`. Possibly icc uses a different rounding method for this downgrade (or perhaps it doesn't follow the standard) – M.M Oct 12 '17 at 22:50
@chux datagrid[i] contains 1.0f for every i, despite the += which is misleading. 2^24 obviously has an even significand, and 2^24+1 will round to nearest, tie to even by default, so it won't get bigger – aka.nice Oct 13 '17 at 00:15
@aka.nice Yes ties to even would not extend the "range" I got my tie [wrong](https://stackoverflow.com/questions/46718656/understanding-the-maximum-values-that-can-be-stored-in-floats-in-c/46718827?noredirect=1#comment80384744_46718827). – chux - Reinstate Monica Oct 13 '17 at 01:49

supercat · Answer 2 · 2017-10-13T14:35:21.037

2

A float can precisely represent any integer between -16777215 and +16777215, inclusive. It can also represent all even integers between -2*16777215 and +2*16777215 (including +/- 2*8388608, i.e. 16777216), all multiples of 4 between -4*16777215 and +4*16777215, and likewise for all power-of-two scaling factors up to 2^104 (roughly 2.028E+31). Additionally, it can represent multiples of 1/2 from -16777215/2 to +16777215/2, multiples of 1/4 from -16777215/4 to +16777215/4, etc. down to multiples of 1/2^149 from -167777215/(2^149) to +16777215/(2^149).

edited Oct 13 '17 at 14:35

answered Oct 12 '17 at 22:33

supercat

77,689
9
166
211

Can't it precisely represent 16777216 (2^24)? yes it can - even if least significant bit corresponds to 2^1. – aka.nice Oct 13 '17 at 00:02
If you are talking about IEEE754 32bit float they can precisely represent any *consecutive* integer up to 1677721**6**, due to their 24 bit (one of which is an implicit 1) fractional field. The next representable number is 16777218. – Bob__ Oct 13 '17 at 00:21
1

@Bob__ supercat implies the "integer up to 16777216" with "all even integers between -2*16777215 and +2*16777215". The "between -16777215 and +16777215" style has an advantage in that `FLT_MAX` is a `+(2^104)*16777215`. – chux - Reinstate Monica Oct 13 '17 at 01:55
@chux Oh, yes, now I see. Thanks. So, there's a typo for the multiples of 4 ;) – Bob__ Oct 13 '17 at 06:24
@chux case of blindness too from my side. but since 2^24 is implicitely cited several times thru 2*2^23 4*2^22 etc..., it does not hurt to cite it once more and be more explicit on the first interval. Anyway, if the technique scales well for integers up to fmax, it doesn't for negative exponents since some fractions with numerator near the middle of [-16777215,+16777215] will start underflowing sooner than the edges and won't be representable. – aka.nice Oct 13 '17 at 07:17
1

@aka.nice It appears to me to work nicely for negative exponents too. With numbers near the "middle", they value will become sub-normal and not underflow. Perhaps you can provide a counter example? – chux - Reinstate Monica Oct 13 '17 at 11:02
@chux my mistake, if we go down to exponent -149 (emin-precision+1), then we cover all finite float values and all fraction are representable, so yes we have a bijection between the set of finite float and the set of fraction generated by this algorithm. – aka.nice Oct 13 '17 at 12:50
I'd started using 16777216, but then realized that doesn't work at the top even though 16777215 works everywhere. I failed to edit them all to match. – supercat Oct 13 '17 at 12:50
@chux: Thanks for the last edit. Seems I'd still missed one. I think the case of 16777216 is a little awkward, since it's available for all exponents but one. Isn't "needed" for any of the ones that it's available, but except at the very top it would be convenient to include it. – supercat Oct 13 '17 at 14:32

score 0 · Answer 3 · answered Oct 12 '17 at 21:06

0

Floating point numbers represent all of the infinite possible values between any two numbers; but, computers cannot hold an infinite number of values. So a compromise is made. The floating point numbers hold an approximation of the value.

This means that if you pick a value that is "more" than the stored floating point number, but not enough to arrive at the "next" storable approximation, then storing that logically bigger number won't actually change the floating point value.

The "error" in a floating point approximation is variable. For small numbers, the error is more precise; for bigger numbers, the error proportionally the same, but a bigger actual value.

answered Oct 12 '17 at 21:06

Edwin Buck

69,361
7
100
138

1

Agree with your overall intent, yet "Floating point numbers represent all of the infinite possible values" needs clarity --> FP numbers represent a limited sub-set of possible values. " ... hold an approximation of the value.". True, yet finite FP values are themselves exact. – chux - Reinstate Monica Oct 12 '17 at 21:19
1

@chux There is a differentiation between the concept of Floating point numbers, and what they really are in a discrete logic world. The first time I mentioned them, I was talking about what they represent, which is "any" number. The second time I talked about them, I talked about how they were implemented, which is the "computers cannot hold an infinite representation". A person well versed in floating point numbers on computers will probably replace the representation with the implementation; but it is obvious that this poster is not that person :) thanks for the comment! – Edwin Buck Oct 13 '17 at 15:48

Understanding the maximum values that can be stored in floats in C

3 Answers3

Linked