Floating Point Addition / Multiplication / Division

Question

I was doing some homework problems from my textbook and had a few questions on floating point rounding / precision for certain arithmetic operations.

If I have casted doubles from an int like so:

int x = random();
double dx = (double) x;

And let's say the variables y, z, dy, and dz follow the same format.

Then would operations like:

(dx + dy) + dz == dx + (dy + dz)
(dx * dy) * dz == dx * (dy * dz)

be associative? I know that if we have fractional representations, then it would not be associative because some precision will be lost due to rounding depending on which operands add / multiply each other. However, since these are casted from ints, I feel like the precision would not be a problem and that these can be associative?

And lastly, the textbook I'm using does not explain FP division at all so I was wondering if this statement was true, or at least just how floating point division works in general:

dx / dx == dz / dz

I looked this up online and I read in some areas like an operation like 3/3 can yield .999...9 but there wasn't enough information to explain how that happened or if it would vary with other division operations.

A good compiler should recognize dx/dx and not actually emit division instructions. — Russell Borogove, May 07 '15 at 02:27
You can exactly represent any value up to 2^53 + 1 as a double. Beyond that you run into rounding errors, even for integer types. http://stackoverflow.com/a/1848762/141172 — Eric J., May 07 '15 at 02:30
you might remember, from your grade school days, that a number divided by itself is 1, so the comparison 'may' work. However, in general, floating point number should never be compared using '==' instead, get the absolute values, get the difference, the check for the difference being less than some threashold — user3629249, May 07 '15 at 03:28
`(dx * dy) * dz == dx * (dy * dz)` a problem if the precision of a `double` < twice precision of an `int` - which is often the case. `(dx + dy) + dz == dx + (dy + dz)` unlikely to be a problem as `double` precision certain more than `int` precision + 1. `dx / dx == dz / dz` obvious problem should `dx==0` or `dz==0` . — chux - Reinstate Monica, May 07 '15 at 04:04
When `dx * dy` and `dy * dz` great than `2^53`, it may be have precision issue. `double dx = (double)(INT_MAX); double dy = (double)(INT_MAX - 0x111111); double dz = (double)(INT_MAX - 0xabcd);` for `(dx * dy) * dz == dx * (dy * dz)` is false. — douyu, Dec 01 '21 at 07:24

score 1 · Answer 1 · answered May 07 '15 at 02:32

1

Assuming int is at most 32-bit, and double follows IEEE-754. double can store integer value at most 2⁵³ precisely.

In the case of addition:

(dx + dy) + dz == dx + (dy + dz)

Both sides of == will have their precise values, so it is associative.

While in the case of multiplication:

(dx * dy) * dz == dx * (dy * dz)

It's possible that the value is over 2⁵³, so they are not guaranteed to be equal.

answered May 07 '15 at 02:32

Yu Hao

119,891
44
235
294

Just clarifying, so the reason the max a double can store is 2^53 is because of the 52-bit mantissa + the implied leading 1? – cheng May 07 '15 at 03:51
To be clear: `double` can store all integers exactly `-pow(2,53) ... +pow(2,53)` --inclusive - just like a 54 bit signed integer or `int54_t`. – chux - Reinstate Monica May 07 '15 at 04:15

score 1 · Answer 2 · answered May 07 '15 at 02:36

You should understand that floating point numbers are typically internally represented as a sign bit, a fixed point mantissa (of 52 bits with an implied leading one for IEEE 64-bit doubles), and a binary exponent (11 bits for IEEE doubles). You can think of the exponent as the "quantum" of math units for a given value.

The addition should be associative if the sums all fit into the mantissa without the exponent going above 2⁰ == 1. If random() is producing 32-bit integers, a sum such as (dx + dy) + dz will fit, and the addition will be associative.

In the case of multiplication, it's easy to see that the product of 2 32-bit numbers may go well over 53 bits, so the exponent may need to go above 1 for the mantissa to contain the magnitude of the result, so associativity fails.

For division, in the particular case of dx / dx, the compiler may replace the expression with a constant 1.0 (perhaps after a zero check).

the exponent also has a built in offset (I'm thinking 256) to allow for both positive and negative exponents with out having to consume a bit for the sign — user3629249, May 07 '15 at 03:30
The exponent offset (usually called bias) is half the exponent range. For IEEE double with 11 bits of exponent, the bias is 1023. For IEEE single, it's 127. — Russell Borogove, May 07 '15 at 13:58

Floating Point Addition / Multiplication / Division

2 Answers2