2

I was implementing calculate mean and stddev for unsigned char type array (say, gray image). To store the sum and average, I use float and double type and get different result. Now I simplify this issue, with minimal reproducing code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(){
    #define BASE 16777306

#if 0
    float a = BASE;
    unsigned char b = 179;
    float c = a + b; // float sum
    printf("float  type, a=%f, (float)b=%f, c=%f, a+b=%d\n", a, (float)b, c, BASE+b);
#else
    double a = BASE;
    unsigned char b = 179;
    double c = a + b; // double sum
    printf("double type, a=%lf, (double)b=%lf, c=%lf, a+b=%d\n", a, (double)b, c, BASE+b);
#endif

    return 0;
}

Result:

double type, a=16777306.000000, (double)b=179.000000, c=16777485.000000, a+b=16777485
float  type, a=16777306.000000, (float)b=179.000000, c=16777484.000000, a+b=16777485

As shown above, 16777484.000000 generated from float type, is 1 less than 16777485.000000(the correct one).

Why using float type for sum leading to wrong result? While double type remains correct as int type?

ChrisZZ
  • 1,521
  • 2
  • 17
  • 24
  • 1
    `float` has only about 23 bits of precision (accurately representing integer values up to roughly 8 million). To make matters worse, when you perform addition on values with vastly different magnitude, you can lose precision too. What is the purpose of using `float` here when you're only wanting integer arithmetic? – paddy Mar 19 '21 at 06:26
  • The rule of thumb is never use floating point in integer calculations where you expect exact results. – dxiv Mar 19 '21 at 06:29
  • @paddy I was developing image processing functions for arm platform, and was told "don't use double since double may be slow, blablabla". – ChrisZZ Mar 19 '21 at 06:31
  • FYI %f and %lf are identical format specifiers. – 273K Mar 19 '21 at 06:32
  • "blablabla" is based in good reasoning and you should listen to it but more importantly try to understand it and not just pass it off as people making noise. But if you _need_ double precision then you should _use_ it. I don't see any need here, because you are just doing integer arithmetic. – paddy Mar 19 '21 at 06:33
  • @paddy For this simple snippet, integer is enough. For a 4000x4000 size gray image (unsigned char), accumulating each pixel value to int32_t may be overflow, and for >= 4104x4104 size gray image, uint32_t may also overflow. I use float type to avoid "overflow" for worst case(all pixel equal 255). – ChrisZZ Mar 19 '21 at 06:38
  • 1
    If you want exact integer values, a `float` is limited to 16777216. Above that value, multiple `int` value map to the exact same `float` value. See [this answer for an example](https://stackoverflow.com/questions/23420783/convert-int-max-to-float-and-then-back-to-integer/23423240#23423240). – user3386109 Mar 19 '21 at 06:38
  • 1
    Why not use `int64_t`? – paddy Mar 19 '21 at 06:38
  • @paddy Good! I just forget `int64_t`. – ChrisZZ Mar 19 '21 at 06:40
  • If I remember correctly a good rule of thumb is that float has about 6 significant decimal digits precision. If you need more get integer types or double depending on use case. But floating point has some inherent imprecision that intergers do not have. – Kami Kaze Mar 19 '21 at 08:23
  • `float` as a 32-bit object, can store exactly about 2^32 different values. 16777485.0 is not one of them. A nearby alternative is 16777484.0. `double` as a 64-bit object, can store exactly about 2^64 different values. 16777485.0 is one of them. The values `float/double` can store are distributed logarithmically not linearly. The "decimal" point "floats", not fixed, like integers. – chux - Reinstate Monica Mar 19 '21 at 10:11

0 Answers0