2

I was doing some homework problems from my textbook and had a few questions on floating point rounding / precision for certain arithmetic operations.

If I have casted doubles from an int like so:

int x = random();
double dx = (double) x; 

And let's say the variables y, z, dy, and dz follow the same format.

Then would operations like:

(dx + dy) + dz == dx + (dy + dz)
(dx * dy) * dz == dx * (dy * dz)

be associative? I know that if we have fractional representations, then it would not be associative because some precision will be lost due to rounding depending on which operands add / multiply each other. However, since these are casted from ints, I feel like the precision would not be a problem and that these can be associative?

And lastly, the textbook I'm using does not explain FP division at all so I was wondering if this statement was true, or at least just how floating point division works in general:

dx / dx == dz / dz

I looked this up online and I read in some areas like an operation like 3/3 can yield .999...9 but there wasn't enough information to explain how that happened or if it would vary with other division operations.

cheng
  • 1,264
  • 2
  • 18
  • 41
  • A good compiler should recognize dx/dx and not actually emit division instructions. – Russell Borogove May 07 '15 at 02:27
  • You can exactly represent any value up to 2^53 + 1 as a double. Beyond that you run into rounding errors, even for integer types. http://stackoverflow.com/a/1848762/141172 – Eric J. May 07 '15 at 02:30
  • you might remember, from your grade school days, that a number divided by itself is 1, so the comparison 'may' work. However, in general, floating point number should never be compared using '==' instead, get the absolute values, get the difference, the check for the difference being less than some threashold – user3629249 May 07 '15 at 03:28
  • `(dx * dy) * dz == dx * (dy * dz)` a problem if the precision of a `double` < twice precision of an `int` - which is often the case. `(dx + dy) + dz == dx + (dy + dz)` unlikely to be a problem as `double` precision certain more than `int` precision + 1. `dx / dx == dz / dz` obvious problem should `dx==0` or `dz==0` . – chux - Reinstate Monica May 07 '15 at 04:04
  • When `dx * dy` and `dy * dz` great than `2^53`, it may be have precision issue. `double dx = (double)(INT_MAX); double dy = (double)(INT_MAX - 0x111111); double dz = (double)(INT_MAX - 0xabcd);` for `(dx * dy) * dz == dx * (dy * dz)` is false. – douyu Dec 01 '21 at 07:24

2 Answers2

1

Assuming int is at most 32-bit, and double follows IEEE-754. double can store integer value at most 253 precisely.


In the case of addition:

(dx + dy) + dz == dx + (dy + dz)

Both sides of == will have their precise values, so it is associative.


While in the case of multiplication:

(dx * dy) * dz == dx * (dy * dz)

It's possible that the value is over 253, so they are not guaranteed to be equal.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
  • Just clarifying, so the reason the max a double can store is 2^53 is because of the 52-bit mantissa + the implied leading 1? – cheng May 07 '15 at 03:51
  • To be clear: `double` can store all integers exactly `-pow(2,53) ... +pow(2,53)` --inclusive - just like a 54 bit signed integer or `int54_t`. – chux - Reinstate Monica May 07 '15 at 04:15
1

You should understand that floating point numbers are typically internally represented as a sign bit, a fixed point mantissa (of 52 bits with an implied leading one for IEEE 64-bit doubles), and a binary exponent (11 bits for IEEE doubles). You can think of the exponent as the "quantum" of math units for a given value.

The addition should be associative if the sums all fit into the mantissa without the exponent going above 20 == 1. If random() is producing 32-bit integers, a sum such as (dx + dy) + dz will fit, and the addition will be associative.

In the case of multiplication, it's easy to see that the product of 2 32-bit numbers may go well over 53 bits, so the exponent may need to go above 1 for the mantissa to contain the magnitude of the result, so associativity fails.

For division, in the particular case of dx / dx, the compiler may replace the expression with a constant 1.0 (perhaps after a zero check).

Russell Borogove
  • 18,516
  • 4
  • 43
  • 50
  • the exponent also has a built in offset (I'm thinking 256) to allow for both positive and negative exponents with out having to consume a bit for the sign – user3629249 May 07 '15 at 03:30
  • The exponent offset (usually called bias) is half the exponent range. For IEEE double with 11 bits of exponent, the bias is 1023. For IEEE single, it's 127. – Russell Borogove May 07 '15 at 13:58