1

I have recently been wondering about multiplying floating point numbers.
Let's assume I have a number, for example 3.1415 with a guaranteed 3-digit precision.
Now, I multiply this value by 10, and I get 31.415X, where X is a digit I cannot define because of the limited precision.

Now, can I be sure, that the five get's carried over to the precise digits? If a number is proven to be precise up to 3 digits I wouldn't expect this five to always pop up there, but after studying many cases in c++ i have noticed that it always happens.

From my point of view, however, this doesn't make any sense, because floating point numbers are stored base-two, so multiplication by ten isn't really possible, it will always be mutiplication by 10.something.

I ask this question because I wanted to create a function that calculates how precise a type is. I have came up with something like this:

template <typename T>
unsigned accuracy(){
        unsigned acc = 0;
        T num = (T)1/(T)3;
        while((unsigned)(num *= 10) == 3){
                acc++;
                num -= 3;
        }
        return acc;
}

Now, this works for any types I've used it with, but I'm still not sure that the first unprecise digit will always be carried over in an unchanged form.

  • 4
    This is not completely correct _"...floating point numbers are stored base-two..."_. They are stored as 1 over base-two and a base-two exponent. eg 0.5 can be represent exactly but 0.1 can not. – Richard Critten Jan 17 '20 at 16:32
  • 1
    related/maybe dupe: https://stackoverflow.com/questions/588004/is-floating-point-math-broken – NathanOliver Jan 17 '20 at 16:42

3 Answers3

2

I'll talk specifically about IEEE754 doubles since that what I think you're asking for.

Doubles are defined as a sign bit, an 11-bit exponent and a 52-bit mantissa, which are concatenated to form a 64-bit value:

sign|exponent|mantissa

Exponent bits are stored in a biased format, which means we store the actual exponent +1023 (for a double). The all-zeros exponent and all-ones exponent are special, so we end up being able to represent an exponent from 2^-1022 to 2^+1023

It's a common misconception that integer values can't be represented exactly by doubles, but we can actually store any integer in [0,2^53) exactly by setting the mantissa and exponent properly, in fact the range [2^52,2^53) can only store the integer values in that range. So 10 is easily stored exactly in a double.

When it comes to multiplying doubles, we effectively have two numbers of this form:

A = (-1)^sA*mA*2^(eA-1023)
B = (-1)^sB*mB*2^(eB-1023)

Where sA,mA,eA are the sign,mantissa and exponent for A (and similarly for B).

If we multiply these:

A*B = (-1)^(sA+sB)*(mA*mB)*2^((eA-1023)+(eB-1023))

We can see that we merely sum the exponents, and then multiply the mantissas. This actually isn't bad for precision! We might overflow the exponent bits (and thus get an infinity), but other than that we just have to round the intermediate mantissa result back to 52 bits, but this will at worst only change the least significant bit in the new mantissa.

Ultimately, the error you'll see will be proportional to the magnitude of the result. But, doubles have an error proportional to their magnitude anyways so this is really as safe as we can get. The way to approximate the error in your number is as |magnitude|*2^-53. In your case, since 10 is exact, the only error will come in the representation of pi. It will have an error of ~2^-51 and thus the result will as well.

As a rule of thumb, I consider doubles to have ~15 digits of decimal precision when thinking about precision concerns.

gct
  • 14,100
  • 15
  • 68
  • 107
  • The preferred term for the fraction (versus exponent) part of a floating-point number is “significand.” “Mantissa” is an old word for the fraction part of a logarithm. Significands are linear (adding 10% to 1 increases the value represented by 10%). Mantissas are logarithmic (adding 10% to 1 multiplies the value represented by 10^.1). – Eric Postpischil Jan 17 '20 at 23:57
  • 1
    The significand of an IEEE binary64 is 53 bits, not 52. 52 are stored in the significand field, but 1 is encoded via the exponent field. – Eric Postpischil Jan 17 '20 at 23:58
  • The errors in binary64 operations are *up to* a number proportional to the magnitude of the result (½ ULP of the result), not simply *proportional* to the magnitude of the result. In other words, there is a known bound on the error that is proportional to the magnitude of the result, but that does not mean the error is proportional to it. The error could be zero or any value from zero up to the bound. – Eric Postpischil Jan 18 '20 at 00:01
  • 1
    “In your case, since 10 is exact, the only error will come in the representation of pi. It will have an error of ~2^-51 and thus the result will as well.” does not seem correct. The question author used 3.1415 as an example, which is not π, but let’s assume they meant π. Then there is a one error in converting π to the floating-point format and a second error in doing a floating-point multiplication by 10. Those errors may compound or happen to cancel. The initial error has a bound of 2^-52 (ULP around 3 is 2^-51, bound is ½ ULP). Multiplying by 10 multiplies the error and adds another. – Eric Postpischil Jan 18 '20 at 00:05
2

Lets assume that for single precision 3.1415 is

0x40490E56

in IEEE 754 format which is a very popular but not the only format used.

01000000010010010000111001010110 0 10000000 10010010000111001010110

so the binary portion is 1.10010010000111001010110

110010010000111001010110 1100 1001 0000 1110 0101 0110 0xC90E56 * 10 = 0x7DA8F5C

Just like in grade school with decimal you worry about the decimal(/binary) point later, you just do a multiply.

01111.10110101000111101011100

to get into IEEE 754 format it needs to be shifted to a 1.mantissa format so that is a shift of 3

1.11110110101000111101011

but look at the three bits chopped off 100 specifically the 1 so this means depending on the rounding mode you round, in this case lets round up

1.11110110101000111101100

0111 1011 0101 0001 1110 1100

0x7BA1EC

now if I already computed the answer:

0x41FB51EC

0 10000011 11110110101000111101100

we moved the point 3 and the exponent reflects that, the mantissa matches what we computed. we did lose one of the original non-zero bits off the right, but is that too much loss?

double, extended, work the same way just more exponent and mantissa bits, more precision and range. but at the end of the day it is nothing more than what we learned in grade school as far as the math goes, the format requires 1.mantissa so you have to use your grade school math to adjust the exponent of the base to get it in that form.

old_timer
  • 69,149
  • 8
  • 89
  • 168
1

Now, can I be sure, that the five get's carried over to the precise digits?

In general, no. You can only be sure about the precision of output when you know the exact representation format used by your system, and know that the correct output is exactly representable in that format.

If you want precise result for any rational input, then you cannot use finite precision.

It seems that your function attempts to calculate how accurately the floating point type can represent 1/3. This accuracy is not useful for evaluating accuracy of representing other numbers.

because floating point numbers are stored base-two

While very common, this is not universally true. Some systems use base-10 for example.

eerorika
  • 232,697
  • 12
  • 197
  • 326