Float subtraction gives me inaccurate results

Question

Im writing a code where measurements, that are taken every half second, need to be subtracted from an initial value to reach 0 eventually. Both values are float. Initial value is 140 000 000 and measurements range from 0.320000001 to 0.389999999.

    float batt = 140000000.00; //capacity 140M units
    float subtr;

    /.../
    while(1){
          batt = float(batt - subtr);
    /.../
    }

So basically I would need it to subtract 0.3xxxxxxxx every cycle of the loop from 148,000,000.00 but it seems there is a size problem so when I debug it I still get 148M every time.

I tried with 1000x smaller batt batt value, 148 000, and converted the measurements from 0.3xxxxxxxx to 0.0003xxxxxxxx. When debugging the code, 148000 - 0.000300005049(measurement value) gives me 147999.469 which is .530699 off from the expected result(147,999.999,699).

It seems that float is not accurate enough for my needs, should I convert my values to some other type or is there any other way I could get accurate results? Was thinking of converting measurements to values without decimals but that wouldn't work either, because the initial value would get way too big for float(148*10^15). When using 140,000,000.00 I am expecting to get accuracy of three decimal places(.xxx) and when using 140,000.00 accuracy of six decimal places(.xxx,xxx) accordingly.

Try `double`? `long double`? The only point of removing "decimals" would be to use an integer type, say `int64_t`. — Marc Glisse, Aug 31 '18 at 16:52
Instead of subtracting really small values from a really large value (which is, indeed, inaccurate), add up the really small values as they come along. When the accumulated value exceeds your start value your'e done. — Pete Becker, Aug 31 '18 at 16:54
@PeteBecker: If the values are small enough, and the target is large enough, that will also be inaccurate; eventually, the running total will reach a magnitude where adding a sufficiently small value is equivalent to adding zero. — ShadowRanger, Aug 31 '18 at 16:55
[Is floating point math broken](https://stackoverflow.com/questions/588004/is-floating-point-math-broken#588014) — Jesper Juhl, Aug 31 '18 at 16:59
If `double`/`long double` don't solve the problem, the best solution I can think of (besides "Use MPFR" or other third party libraries) is use @PeteBecker's solution, but whenever the `float` accumulator exceeds `1.0`, remove the integer component and put it in a `uint64_t`, leaving only the fractional component. Or equivalently, when the accumulator reaches some sufficiently large number, subtract it from the target value and reset it 0. *Big* risk of repeated rounding error there though. — ShadowRanger, Aug 31 '18 at 17:00

Maxim Egorushkin · Answer 1 · 2018-09-03T10:17:19.143

When you do 140000000 - 0.389 the second operand needs to be scaled to have the same exponent as the first one: 1.4e8 - 0.00000000389e8 = 1.39999999611e8. Intel CPUs currently do floating point calculations in extended precision 80-bit format but when storing the result back into 32-bit float 1.39999999611e8 gets rounded back to 1.4e8 because float has roughly 6 decimal digits of precision.

Storing decimal number 148000000.0003xxxxxxxx requires roughly 24 decimal digits of precision, or 80 binary digits. 80-bit long double may just do:

int main() {
    float a = 140000000.f;
    float b = 0.389999999f;
    printf("%f\n", a);
    printf("%f\n", b);
    printf("float result:       %.16f\n", a - b); // Round the 80-bit extended precision result to 32-bit.
    printf("double result:      %.16f\n", static_cast<double>(a)); // Round the 80-bit extended precision result to 64-bit.
    printf("long double result: %.16Lf\n", static_cast<long double>(a) - b); // 80-bit extended precision result.
}

Outputs:

140000000.000000
0.390000
float result:       140000000.0000000000000000
double result:      140000000.0000000000000000
long double result: 139999999.6100000143051147

This is not an accurate description of floating-point arithmetic. Per IEEE 754, the result is computed as if infinitely precise arithmetic is used, and then the operand is rounded to the nearest representable value (in a direction determined by the rounding rule in effect). The significand (the preferred term for the fraction part of a floating-point number; “mantissa” is the fraction part of a logarithm) is not rounded or truncated prior to arithmetic. — Eric Postpischil, Sep 02 '18 at 18:10

Float subtraction gives me inaccurate results

1 Answers1