1

I'm not even sure what I'm asking for is possible, but here goes:

Our C++ code performs the following calculation:

double get_delta(double lhs, double rhs) {
  return lhs - rhs;
}

and with the inputs (nearest double to) 655.36 and 655.34 this does not produce the "nearest" double to exactly 0.02, but a value closer to 0.019999...

Of course, one cannot expect precise results from IEEE double, but I'm wondering if the naive delta calculation could be improved to get nearer to what we expect want:

Given that the two input values can be represented pretty precisely (15 digits, see below), it is unfortunate, but expected, that the difference doe not have the same precision (vs. the ideal result):


As can be seen by the following values, by subtracting two values from each other, the resulting delta value has less significant digits than the two starting values.

While the example values are precise up to 16 decimal digits (DBL_DIG is 15 after all), the resulting Delta Value is only precise up to 13 decimal digits. That is, all digits after the 13th are made up of noise after digit 16 in the original values.

So in this case, rounding the delta value to 13 significant decimal digits would again yield the "correct" result, in that it would give me 0.02d.

So, possibly the question then becomes:

Given two subtraction values a - b that are both assumed to be at double-precision, i.e. a precision of 15 significant decimal digits, how do you calculate the precision of the resulting difference?


Specifically:

655.36 === 6.55360000000000013642420526594 E2  [v1]
655.34 === 6.55340000000000031832314562052 E2  [v2]
           ^                ^
           1                17

  0.02 === 2.00000000000000004163336342344 E-2 [v3]

655.36
-      === 1.9999999999981810106 E-2 (as calculated in MSVC 2019) [v4]
655.34
           ^            ^   ^
           1            13  17

As requested, here's a C++ program generating and printing the numbers involved: https://gist.github.com/bilbothebaggins/d8a44d38b4b54bfefbb67feb5baad0f5

The Numbers as printed from C++ are:

'a'@14: 6.55360000000000e+02
' '@__: ................
'a'@18: 6.553600000000000136e+02
'a'@XX: 0x40847ae147ae147b

'b'@14: 6.55340000000000e+02
' '@__: ................
'b'@18: 6.553400000000000318e+02
'b'@XX: 0x40847ab851eb851f

'd'@14: 1.99999999999818e-02
' '@__: ................
'd'@18: 1.999999999998181011e-02
'd'@XX: 0x3f947ae147ae0000

'2'@14: 2.00000000000000e-02
' '@__: ................
'2'@18: 2.000000000000000042e-02
'2'@XX: 0x3f947ae147ae147b

The delta is later used for some multiplication which naturally worsens the error even more.

Is there a way to do the delta calculation in a more sophisticated way so that we get closer to the "correct" answer?


My current thoughts are along the lines of:

If we were to calculate using an infinite precision type, given an input of double, we would first have to decide how to round the given doubles to our infinite precision type.

Say, we round to 15 decimal digits (that would be way enough for our use case's inputs), we would get exact values -- i.e. 655.3? exactly. And if we then calculate the infinite precision delta, we would get 0.02 exactly. And if we then would convert that value back to double, we'd get the value v3and not the "wrong" value v4.

So, would there be a way to normalize the starting values so that this (rounding) process could be replicated in pure IEEE754 calculations?

Martin Ba
  • 37,187
  • 33
  • 183
  • 337
  • 3
    I'm not sure I understand but the inputs aren't `655.36` and `655.34` in `get_delta`. They are `655.36000000000001...` and `655.34000000000003...` so, it's already too late there. Neither `655.36` nor `655.34` can be exactly represented by an IEEE-754 `double`. – Ted Lyngmo Feb 09 '22 at 12:20
  • 1
    IEEE754 double has binary significand and so precisely 0.02 is impossible to represent. If you need fixed point value with up to 15 decimal digits then int64_t seems good enough as it has 18 decimal digits. – Öö Tiib Feb 09 '22 at 12:30
  • Please post the exact values of your input numbers and output numbers - `printf"%a %a %a", lhs, rhs, result);` from input and output numbers. `Specifically:` Are you posting some output from some javascript browser code? `f the naive delta calculation could be improved` yes. more precision! `long double` still more precision? https://en.wikipedia.org/wiki/List_of_arbitrary-precision_arithmetic_software `IFF we were to calculate using an infinite precision type` how do you know the input is not rounded that way so the result is `1.9`? – KamilCuk Feb 09 '22 at 13:02
  • Maybe this can shed some more light : https://learn.microsoft.com/en-us/cpp/build/reference/fp-specify-floating-point-behavior?view=msvc-170 – Robert Andrzejuk Feb 09 '22 at 13:15
  • Use `printf("%.*e\n", DBL_DIG - 1, get_delta(lhs, rhs));`. If this is not what you want, please provide more info, perhaps with examples including input, desired output and actual output. – chux - Reinstate Monica Feb 09 '22 at 14:23
  • `I posted the exact values.` So your input numbers are 0x40847AE147AE147B and 0x40847AB851EB851F ? – KamilCuk Feb 09 '22 at 14:31
  • `uint64_as_double(0x3f947ae147ae0000ULL`) *is* the *exact* difference between `uint64_as_double (0x40847ae147ae147bULL)` and `uint64_as_double (0x40847ab851eb851fULL);`. – njuffa Feb 09 '22 at 18:36
  • @njuffa yes it is the exact difference, but it is not the nearest value to the infinite-precision difference. It has *less* precision than the input values, so if you round it to less digits (as explained in the Q), you could get to a more correct result -- more correct in the sense that it approximates the infinite-precision calculation better. – Martin Ba Feb 09 '22 at 21:01
  • @MartinBa The **exact** difference of two IEEE-754 `binary64` operands is by definition the "the nearest value to the infinite-precision difference". The misunderstanding underlying the question appears to be the (incorrect) assumption that the decimal representation inspected exactly represents the `binary64` values involved. As such the question seems to be a duplicate of [this question](https://stackoverflow.com/questions/588004/is-floating-point-math-broken). – njuffa Feb 09 '22 at 21:05
  • Based on the `binary64` operands from the question, the computation here is: 6.553600000000000136424205265939235687255859375E+2 - 6.553400000000000318323145620524883270263671875E+2 = 1.999999999998181010596454143524169921875E-2 – njuffa Feb 09 '22 at 21:22
  • 1
    "how do you calculate the precision of the resulting difference?" --> If `|a| > |b|`, the precision of the difference is about `10e-15*fabs(a)` or `DBL_EPSILON(fabs(a))`, possibly better. – chux - Reinstate Monica Feb 10 '22 at 04:11
  • @njuffa - No, your link does not answer my question. Indeed given all the info I had added at the time of your comment I do find it insulting (as far as that's still possible here) that you would even suggest so. Granted, it seems I was not able to phrase my thoughts in a way for a wider audience to grok. Maybe I get around to a follow up that's more focused. Genuine thanks anyway. – Martin Ba Feb 10 '22 at 07:54
  • @Martin Ba The "Does this answer your question" verbiage that strikes you as insulting was *auto-generated* by Stackoverflow. This apparently happens whenever someone (me in this case) selects "duplicate" as the close reason. The basic point here is: On a system whose floating-point arithmetic is based on IEEE-754, computing `lhs - rhs` *does* return the difference that is closest to the infinitely precise result, and as I demonstrated, this happens here. – njuffa Feb 10 '22 at 08:06
  • @njuffa - I would add that I think we have a difference idea of "closest to the infinitely precise result" here. Elaborating probably too much for the comments. – Martin Ba Feb 10 '22 at 08:22

1 Answers1

-2

Get "precise" difference between two "near" IEEE754 double values?

This is generally not possible using finite floating point arithmetic, because the precise difference is not necessarily representable by the floating point type.

This can be achieved by converting the floating point numbers to an arbitrary precision representation, and calculating the result using that arbitrary precision. There are no arbitrary precision types in the standard library.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • OP already has these clarifications in their question. – Robert Harvey Feb 09 '22 at 14:01
  • @RobertHarvey I know. OP answered their own question. That doesn't change what the answer is. – eerorika Feb 09 '22 at 18:36
  • 2
    See also [Sterbenz' lemma](https://en.wikipedia.org/wiki/Sterbenz_lemma), which implies that for two "near" IEEE 754 binary64 values `x` and `y` ("near" in the sense that they have the same sign and are within a factor of two of each other in magnitude), the difference _is_ necessarily exactly representable in the binary64 format. So for two values that are near in that sense, converting the values to a higher precision before performing the subtraction wouldn't help. – Mark Dickinson Feb 10 '22 at 14:17