1

Well let's suppose that a is a normalized floating point number in basis 2 (binary system). Is the following equality correct?

fl(a+a+a)=fl(3*a)

SlimJim
  • 13
  • 3
  • 1
    Consider the last 2 bits of the mantissa, write down what the result of the multiplication and additions will be for each each possible value. – EOF Feb 09 '20 at 21:35
  • Somehow related to https://stackoverflow.com/questions/21690585/is-3xx-always-exact – aka.nice Feb 10 '20 at 16:26

2 Answers2

2

Notation and Prequisites

a denotes a mathematical number. a denotes a floating-point value. 3a denotes a mathematical expression (using real-number arithmetic). a+a+a and 3*a denote expressions using floating-point arithmetic.

A fundamental characteristic of arithmetic in typical floating-point systems is that the result of a floating-point operation is defined to be the mathematical result rounded to the nearest representable value in some direction (most often in the direction of the nearest representable value with ties to the representable value with the even low digit, but other directions may be elected).1

Finite, Normal Values

In binary floating-point, if a is a representable and 2a is within the finite range, then 2a is representable, since the only difference in their representations is in the exponent. Therefore, given a floating-point number a representing the number a, the result of a+a is exactly 2a. Then the floating-point result of a+a+a (which is (a+a)+a) is the mathematical result 3a (since the mathematical result is 2a+a) rounded to the nearest representable value. And the floating-point result of 3*a is also the mathematical result 3a rounded to the nearest representable value. Therefore a+a+a and 3*a have the same floating-point result, and the equality holds.

Special Cases

It remains to consider special cases.

If a, representing a, is finite, but 2a exceeds the range for which the floating-point result is finite, then a+a produces an infinity, and a+a+a produces the same infinity, and so does 3*a, so the equality holds.

If a is an infinity, then a+a, a+a+a, and 3*a produce the same infinity, and the equality holds.

If a is a NaN, then a+a+a and 3*a are both NaNs, and they do not compare equal because two NaNs are never numbers with equal values.

Footnote

1 The question does not specify the floating-point system used. Certainly one can define a floating-point system in which 1+1 produces 0 and 0+1 produces 0 while 3*1 produces 5. However, for the purposes of this answer, we assume a typical floating-point system.

Community
  • 1
  • 1
Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • In addition, I think the following holds for IEEE-754 arithmetic: If we assume '=' in the question denotes a test for bit-wise identity (as opposed to floating-point equality), then `a+a+a` and `3*a` deliver identical results if `a` is a NaN, whether an implementation supports NaN payloads or not. – njuffa Feb 10 '20 at 01:33
  • 1
    @njuffa: And for signed zeros, I think. – Eric Postpischil Feb 10 '20 at 01:36
  • "fundamental characteristic of arithmetic in typical floating-point systems is that the result of a floating-point operation is defined to be the mathematical result rounded to the nearest representable value in some direction" not true. IEEE rounds, DSPs like TI do not. for a good reason, dont generalize to IEEE. Note the math here for a+a+a vs 3*a does not rely on rounding. show WHY these statements work rather than just make statements that have to be then verified elsewhere – old_timer Feb 10 '20 at 02:07
  • 1
    @old_timer: (a) What do “DSPs like TI” do? (TI says some of them use IEEE-754.) (b) Customizing the answer to IEEE-754 would be a specialization, not a generalization. (c) What do you mean the math for `a+a+a` versus `3*a` does not rely on rounding? Yes, it does. (d) I do show why these statements work; the answer is written as a mathematical proof. It states axioms and givens and uses logical deductions from them, along with basic arithmetic properties. – Eric Postpischil Feb 10 '20 at 02:19
  • Thanks! So in the same manner if a,b,c normalized, and don't exceed the range then fl((a+b)+c) should give the same mathematical result as fl(a+(b+c)) right? Sorry for the multiple questions I'm new to floating point arithmetics – SlimJim Feb 10 '20 at 12:36
  • 1
    @SlimJim: No. fl(fl(a+a)+a) equals fl(a+a+a) because fl(a+a) is **exact**, and it is exact because a+a = 2a, and 2 is the floating-point base. This also means that fl(fl(a+a)+b) = fl(a+a+b). However, in general, fl(a+b) is not exact, and fl(fl(a+b)+c) is, in general, not equal to fl(a+b+c). (Also, if the floating-point base is not 2, then fl(a+a) is not in general exact. For example, with base 16 and one digit of precision, 9•16^0+9•16^0 produces the one-hexadecimal-digit result 1•16^1 = 16, but the mathematical result is 9+9 = 18 (hexadecimal 12).) – Eric Postpischil Feb 10 '20 at 12:59
0

You didn't specify a format, IEEE 754 is popular but not the only beast out there.

As a general rule don't use equality with floating point because floating point binary formats are not the same as infinitely accurate numbers. Assume that a case where you have two operations on one side and one operation on the other, each operation can clip so loss of precision per operation should be expected two operations could lose/change more than than a single operation. Rounding if part of the format and implementation can help undo some of this loss in specific cases, but is not a complete fix, and sometimes hurts.

So as a rule don't assume this will work, but in this specific case, let's start with an IEEE style format 1.fraction, we don't need very many bits of mantissa to see how this works.

1.abc (a,b,c are single bits)

As Eric points out start with 2*a

  1abc
+ 1abc
========
 1xxx0

no matter what c is, a 1 or a 0, the lsbit is a 0. fixed point 1abc+1abc = 2*1abc so long as we don't overflow the exponent:

  1abc
+ 1abc
========
 1abc0

you can walk through the combinations and number of mantissa bits if you like.

adding a+a+a comes to this point

 1abc0
+ 1abc
=======

for the multiply case

    1abc
    1100
  ====== 
    0000
   0000
  1abc
+1abc
===========

which comes to this point

   1abc
+ 1abc0
========

so they end in the same place they are equal, yes?

Rounding is not a factor here as the lower bits result in the same values through either path...yes?

Negative numbers in IEEE for example also use the positive 1.f format and sign is handled separately. In this case down the multiply path p * p is positive p * n is negative so both paths work you retain the sign (3 is always positive here). For the addition path the results either get more negative or more positive, they retain their sign.

The first addition can result in an overflow of the exponent which can lead to a NAN in IEEE for example. The second addition, same story. The multiply also can result in an NAN and best to assume that through the two paths the two NAN results are not binary equal, so the comparison won't work. If I remembering right one will be signaling and the other quiet yes?

Other formats. Still working on that should be as easy to demonstrate as the 1.f type format(s). positive 01.fraction, negative 10.fraction for example.

The key here being that the implementation of the multiply should be such that it supports 2n number of bits needed (for fixed point n bits * n bits = 2n bits will cover all cases 32 bit number times 32 bit number needs 64 bits to store it.) Formats tending to force a one at the top of the value means you are multiplying 1xxxxx... * 1xxxxx.... from grade school

a.bcd
e.fgh
======

We would do the math as fixed point abcd * efgh then automatically move the decimal point 6 over in the result. Then binary floating point will need to normalize and move the most significant 1 or 0 depending on the format just to the left of the binary point, adjusting the exponent, similar to how we would use powers of 10 in scientific notation to further adjust the result. Ideally the logic is going to use enough bits to make the fixed point math work out right, then normalize that, then clip and maybe round.

So at the end of the day you will see something like

     abcd
    abcd
   abcd
  abcd
+ ...

And for the 3*a vs a+a+a case

best case keep the c as a sticky bit

 1abc0
+ 1abc
=======
 xxxxx

worst case clip it

 1abc0
+ 1abc
=======
 xxxx  

for multiply

best case

    0000
   0000
  1abc
+1abc
===========
 xxxxxxx`

worst case clipping

    0 
   00 
  1ab 
+1abc
===========
 xxxx`

In both cases we keep/lose a single bit which would down both paths produce the same result if both the add and multiply make the same optimizations. This is only the case for the 3*a vs a+a+a, would have to walk through the possibilities for any other equation.

Bottom line: assume that equations that are exactly equal using infinite precision math are not equal using binary floating point math. But there are cases like these that can be shown to be equal (depending on implementation), I would call these exceptions not the rule.

Assume one way then be happy if it isn't, even better if you can prove it isn't (for your specific target hardware or the format in general) then remembering and documenting the limitations of that proof. If you forget the proof or don't document the rules, then someone will come along later and mess up and the math can/will be wrong and the software won't work. So be very very careful making math optimizations without thinking through them. In particular if you are expecting an equals comparison to work.

I worked on control systems where we had to modify the lsbit of a mantissa to make the math work, computing the constants front door didn't produce the right set of values. Any further adjustment in that control systems parameters would require a check of the values and possibly tweaking one or more of them for the math to work.

halfer
  • 19,824
  • 17
  • 99
  • 186
old_timer
  • 69,149
  • 8
  • 89
  • 168