1

Is there a simple way to tell whether a particular number gets rounded up in it's floating point representation? The reason I ask is related to a question I asked here and a similar question was asked here, amongst others.

To recap, I was trying to ask why, for example, the expression 0.5 % 0.1 doesn't result in approximately zero but instead gives (approximately) 0.1. Many respondents blah on about how most numbers can't be exactly represented and so on but fail to actually explain why, for certain values, the result from the % operator is so far from zero when there is no remainder. It took me a long time to work out what was happening and I think it's worth sharing. Also, it explains why I've asked my question.

It seems that the % operator doesn't result is zero when it should if ths divisor is rounded up in it's floating point format but the dividend isn't. The division algorithm iteratively subtracts the divisor from the dividend until it would result in a negative value. The quotient is the number of iterations and the remainder is what's left of the dividend. It may not be immediately clear why this results in errors (it certainly wasn't to me) so I'll give an example.

For the 0.5 % 0.1 = (approximately) 0.1 case, 0.5 can be represented exactly, but 0.1 cannot and is rounded up. In binary 0.5 is represented simply as 0.1, but 0.1 in binary is 0.00011001100... repeating last 4 digits. Because of the way the floating point format works, this gets truncated to 23 digits (in single precision) after the initial 1. (See the much cited What Every Computer Scientist Should Know About Floating-Point Arithmetic for a full explanation.) Then it's rounded up, as this is closer to the 0.1(decimal) value. So, the values that the division algorithm works with are:

0.1 0000 0000 0000 0000 0000 000 --> 0.5 (decimal), and

0.0001 1001 1001 1001 1001 1001 101 --> 0.1 (decimal)

The division algorithm iterations are;

(1) 1.00000000000000000000000 - 0.000110011001100110011001101 =

(2) 0.011001100110011001100110011 - 0.000110011001100110011001101 =

(3) 0.01001100110011001100110011 - 0.000110011001100110011001101 =

(4) 0.001100110011001100110011001 - 0.000110011001100110011001101 =

(x) 0.0001100110011001100110011 - 0.000110011001100110011001101 =

-0.000000000000000000000000001

As shown, after the 4th iteration further subtraction would result in a negative, so the algorithm stops and the value of the dividend left over (in bold) is the remainder, the approximation of decimal 0.1.

Further, the expression 0.6 % 0.1 works as expected as 0.6 gets rounded up. The expression 0.7 % 0.1 doesn't work as expected and although 0.7 can't be represented exactly, it doesn't get rounded up. I've not tested this exhaustively but I think this is what's going on. Which brings me (at last!) to my actual question:

Does anyone know of simple way to tell if a particular number will be rounded up?

Community
  • 1
  • 1
  • Maybe https://en.wikipedia.org/wiki/Unit_in_the_last_place (so `Math.ulp`) –  Oct 15 '16 at 17:18
  • The modulus calculation does not work iteratively. That would be insanely expensive. Also, it is somewhat brain-damaged, in that it returns funny results like those that you just experienced. So, instead of using the modulus operator, use `Math.IEEERemainder()` and be done with it. – Mike Nakis Oct 15 '16 at 17:48
  • Y'know, Math.IEEERemainder was first thing I tried but somehow though it was behaving the same. Well, ya live an learn. You sure about "modulus calculation does not work iteratively"? Every FPU division algorithm I've seen uses some kind of division by subtraction... Be glad to hear a different method though. – OffGridAndy Oct 15 '16 at 18:32
  • 1
    @MikeNakis Actually, floating-point remainder operations (e.g. `fmod` and `remainder` in C/C++) frequently do work iteratively based on subtraction. I know first hand, from implementing the `FPREM` and `FPREM1` microcode for an x86 processor (many hundreds of millions shipped), and implementing `fmod()` and `remainder()` for a (shipping) math library. Typically the iterations generate one bit of quotient per step. One can use FP division, but even then an iterative process is needed, and it is often less efficient since in real-life use magnitude of dividend and divisor are often close. – njuffa Oct 15 '16 at 18:43
  • @njuffa wow. First of all, respect. But then, one bit of quotient per iteration puts a log2(N) limit to the calculation, instead of a limit of N divided by a constant factor. But how can you be using subtraction and yet be generating one bit of quotient per iteration? – Mike Nakis Oct 15 '16 at 22:20
  • @MikeNakis Each step is scale-add or scale-subtract (note that the same applies to bit-wise integer division algorithms, where "scale" maps to a shift. [Here](https://github.com/pathscale/nvidia_sdk_samples/blob/master/vectorAdd/build/cuda/5.0.35-13978363_x64/include/math_functions_dbl_ptx3.h) are implementations of `fmod()`, `remainder()`, and `remquo()` I wrote for the CUDA standard math library, lines 2873-3199. I tried a variant based on FP division with round-to-zero (which allows one to generate up to 52 quotient bits at a time), but it was slightly slower than this in real applications. – njuffa Oct 15 '16 at 22:30

1 Answers1

0

Let's consider the case when floats a > b > 0. Each float is a multiple of it's ulp and we can write:

a = na*ulp(a). ulp(a)=2^ea. na is the integer significand of a. ea is its biased exponent.
b = nb*ulp(b). ulp(b)=2^eb. nb is the integer significand of b. eb is its biased exponent.
For normalized float, 2^p > na >= 2^(p-1) where p is the float precision (p=53 bits for IEEE 754 double precision).

So we can perform (possibly large) integer division: na*2^(ea-eb)=nb*q+nr

From which we deduce na*2^(ea-eb)*2^eb = nb*2^eb*q+nr*2^eb, that is a=b*q+nr*2^eb.
In other words, nr is the integer significand of the float remainder and eb its biased exponent, before normalisation.

From this, we see that the remainder operation is exact, because obviously nr <= nb, so the remainder is representable as float. So strictly speaking, the remainder is never rounded up.

When quotient is rounded to nearest int rather than truncated, which is the IEEE remainder operation,

a=b*q+r

then, the remainder can be negative r<0
In which case you are interested in:

a=b*(q-1) + (b+r)

I presume that this case with a negative r forcing a b+rresult is what you call rounded up. Unfortunately, there is no easy way to tell if the remainder will be negative without performing the operation, except maybe when nb is a power of two (2^(p-1) or less in case of gradual underflow).

But you seem to be interested in the specific case a=i/10^j and b=1/10^j but only have float approximation float(i/10^j) and float(1/10^j). Assuming 10^j and i are representable exactly (j<23 in double precision and i<=2^53), then we have access to the representation error with a fused multiply add:

ea=fma(10^j,float(i/10^j),-i).  10^j*float(a)=10^j*a+ea.
eb=fma(10^j,float(1/10^j),-1).  10^j*float(b)=10^j*b+eb.

You have i*b=a
Now you want to compare how it goes with float approximation so you just get the remainder:

r = (a+ea/10^j)-i*(b+eb/10^j) = 1/10^j * ea - i/10^j * eb.

The float approximation could possibly work, but not allways:

float(float(float(b)*ea) - float(float(a)*eb))

However, you'd much better use fma again:

r = fma(-i,eb,ea)/10^j

The sign of the remainder will give you the side of the float approximation...
Here we simplified a bit the problem because we didn't consider the case when the quotient could be off by more than 1. That should be OK because i < 2^53 but we did not prove it.
And it's just an exercize of style, because we are replacing a simple expression by more complex ones.

aka.nice
  • 9,100
  • 1
  • 28
  • 40