c++ (double)0.700 * int(1000) => 699 (Not the double precision issue)

Question

using g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

I have tried different typecasting of scaledvalue2 but not until I stored the multiplication in a double variable and then to an int could I get desired result.. but I can't explain why ???

I know double precission(0.6999999999999999555910790149937383830547332763671875) is an issue but I don't understand why one way is OK and the other is not ??

I would expect both to fail if precision is a problem.

I DON'T NEED solution to fix it.. but just a WHY ?? (the problem IS fixed)

void main()
{
    double value = 0.7;
    int scaleFactor = 1000;

    double doubleScaled = (double)scaleFactor * value; 
    int scaledvalue1 = doubleScaled; // = 700

    int scaledvalue2 = (double)((double)(scaleFactor) * value);  // = 699 ??

    int scaledvalue3 = (double)(1000.0 * 0.7);  // = 700 

    std::ostringstream oss;
    oss << scaledvalue2;
    printf("convert FloatValue[%f] multi with %i to get %f = %i or %i or %i[%s]\r\n",
      value,scaleFactor,doubleScaled,scaledvalue1,scaledvalue2,scaledvalue3,oss.str().c_str());

}

or in short:

value = 0.6999999999999999555910790149937383830547332763671875;
int scaledvalue_a = (double)(1000 * value);  // =  699??
int scaledvalue_b = (double)(1000 * 0.6999999999999999555910790149937383830547332763671875);  // =  700
// scaledvalue_a = 699
// scaledvalue_b = 700

I can't figure out what is going wrong here.

Output :

convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 699 or 700[699]

vendor_id : GenuineIntel

cpu family : 6

model : 54

model name : Intel(R) Atom(TM) CPU N2600 @ 1.60GHz

Of the top of my head, I don't know. I'm going to reopen this. Looks like some funky 80 bit internal computations going on. Change the title quickfast else it might be closed again. — Bathsheba, Nov 03 '16 at 12:11
Could you simplify this down to an `int main()`? I'm trying to guard against rapid closure of this question. — Bathsheba, Nov 03 '16 at 12:19
seems to be platform-dependent: http://ideone.com/aKqeq0 returns convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 700[700] — midor, Nov 03 '16 at 12:20
Hum. I think it's due to your platform using 80 bit floating point internally. Just need a boffin to confirm to me that the closest 80 bit binary float to 0.7 is larger. — Bathsheba, Nov 03 '16 at 12:21
Something is rounding the 0.699999999...... but i would still expect the 2 ways to do the same. — Ratman, Nov 03 '16 at 12:27
[Unable to reproduce](https://ideone.com/EGUfEm). Also cannot reproduce with gcc 6.2.1. — Sam Varshavchik, Nov 03 '16 at 12:31
[Does the new code compile](http://coliru.stacked-crooked.com/a/b8ddeb1675168a1f)? — wally, Nov 03 '16 at 12:42
I can add this: int scaledvalue3 = (double)(1000.0 * 0.7); // = 700 — Ratman, Nov 03 '16 at 13:12
For 64 bit double, hex 3fe6666666666666 = .69999999999999996, and hex 3fe6666666666667 = .70000000000000007. For 80 bit long double, hex 3ffeb333333333333333 = .69999999999999999999, and 3ffeb333333333333334 = .70000000000000000004. So in both cases the nearest value to .7 is just below .7. A multiply by 1000 rounds to an exact value for 700, 64 bit is hex 4085e00000000000, 80 bit is hex 4008af00000000000000. I'm using Microsoft compilers (old 16 bit one to get 80 bit long doubles), on an Intel 3770k. — rcgldr, Nov 03 '16 at 13:21
hehe: int scaledvalue4 = (double)(((double)scaleFactor) * value); //=699 Still 699... — Ratman, Nov 03 '16 at 13:28
I'll try later... but my initial question you all give the the answer i got... "this is a bit wierd" :-) — Ratman, Nov 03 '16 at 13:49
Typically, a double in a register is 80bit long, while a double in memory is only 64 bit long. Storing the value in a variable may end up moving the value to memory, causing an extra rounding step. When you reread the variable (back to an 80bit register), you don't get the same 80bit number as originally, so the cast gives a different value. This is mostly guesswork. — Marc Glisse, Nov 03 '16 at 14:15
Managed to reproduce in g++ 4.4.5 (x86). Didn't find that compiler online, but see http://pastebin.com/tzjtsYrE. I suspect the rounding mode is different, but maybe someone can deduce more from the assembly output I provided there? — mindriot, Nov 03 '16 at 17:44
@mindriot - What is the value for .LC0? The assembly code sequence for the two multiplies is the same. After the first multiply, the code stores and reloads doubleScaled, then stores scaledValue1 (int). The code does the same multiply again, then store the product into scaledValue2 (int). Are you seeing a difference between scaledValue1 and scaledValue2 when you run the example? The floating point control word (stored at 30(%esp)) is set to truncate / single precision (and stored at 28(%esp)) just before the integer stores, but it's the same for both integer stores. — rcgldr, Nov 04 '16 at 10:15
@rcgldr Sorry, forgot to include that value in the paste. It's two `long`s: 1717986918, 1072064102. I thought that maybe in one of the cases it switches to round-to-nearest mode, but I'm not that savvy when it comes to x86 floating-point assembly… — mindriot, Nov 04 '16 at 10:19
@mindriot - that's the correct value: hex 3fe6666666666666. I determined what the issue is, and will be posting an answer. — rcgldr, Nov 04 '16 at 10:50

score 1 · Accepted Answer · answered Nov 03 '16 at 17:51

This is going to be a bit handwaving; I was up too late last night watching the Cubs win the World Series, so don't insist on precision.

The rules for evaluating floating-point expressions are somewhat flexible, and compilers typically treat floating-point expressions even more flexibly than the rules formally allow. This makes evaluation of floating-point expressions faster, at the expense of making the results somewhat less predictable. Speed is important for floating-point calculations. Java initially made the mistake of imposing exact requirements on floating-point expressions and the numerics community screamed with pain. Java had to give in to the real world and relax those requirements.

double f();
double g();
double d = f() + g(); // 1
double dd1 = 1.6 * d; // 2
double dd2 = 1.6 * (f() + g()); // 3

On x86 hardware (i.e., just about every desktop system in existence), floating-point calculations are in fact done with 80 bits of precision (unless you set some switches that kill performance, as Java required), even though double and float are 64 bits and 32 bits, respectively. So for arithmetic operations the operands are converted up to 80 bits and the results are converted back down to 64 or 32 bits. That's slow, so the generated code typically delays doing conversions as long as possible, doing all of the calculation with 80-bit precision.

But C and C++ both require that when a value is stored into a floating-point variable, the conversion has to be done. So, formally, in line //1, the compiler must convert the sum back to 64 bits to store it into the variable d. Then the value of dd1, calculated in line //2, must be computed using the value that was stored into d, i.e., a 64-bit value, while the value of dd2, calculated in line //3, can be calculated using f() + g(), i.e., a full 80-bit value. Those extra bits can make a difference, and the value of dd1 might be different from the value of dd2.

And often the compiler will hang on to the 80-bit value of f() + g() and use that instead of the value stored in d when it calculates the value of dd1. That's a non-conforming optimization, but as far as I know, every compiler does that sort of thing by default. They all have command-line switches to enforce the strictly-required behavior, so if you want slower code you can get it. <g>

For serious number crunching, speed is critical, so this flexibility is welcome, and number-crunching code is carefully written to avoid sensitivity to this kind of subtle difference. People get PhDs for figuring out how to make floating-point code fast and effective, so don't feel bad that the results you see don't seem to make sense. They don't, but they're close enough that, handled carefully, they give correct results without a speed penalty.

"On x86 hardware (i.e., just about every desktop system in existence), floating-point calculations are in fact done with 80 bits of precision (unless you set some switches that kill performance, as Java required)": wrong decade. Nowadays, SSE (64 bits) is faster than x87. — Marc Glisse, Nov 04 '16 at 11:26

score 1 · Answer 2 · answered Nov 03 '16 at 18:14

Since x86 floating-point unit performs its computations in extended precision floating point type (80 bits wide), the result might easily depend on whether the intermediate values were forcefully converted to double (64-bit floating-point type). In that respect, in non-optimized code it is not unusual to see compilers treat memory writes to double variables literally, but ignore "unnecessary" casts to double applied to temporary intermediate values.

In your example, the first part involves saving the intermediate result in a double variable

double doubleScaled = (double)scaleFactor * value; 
int scaledvalue1 = doubleScaled; // = 700

The compiler takes it literally and does indeed store the product in a double variable doubleScaled, which unavoidably requires converting the 80-bit product to double. Later that double value is read from memory again and then converted to int type.

The second part

int scaledvalue2 = (double)((double)(scaleFactor) * value);  // = 699 ??

involves conversions that the compiler might see as unnecessary (and they indeed are unnecessary from the point of view of abstract C++ machine). The compiler ignores them, which means that the final int value is generated directly from the 80-bit product.

The presence of that intermediate conversion to double in the first variant (and its absence in the second one) is what causes that difference.

rcgldr · Answer 3 · 2016-11-04T11:00:09.833

I converted mindriot's example assembly code to Intel syntax to test with Visual Studio. I could only reproduce the error by setting the floating point control word to use extended precision.

The issue is that rounding is performed when converting from extended precision to double precision when storing a double, versus truncation is performed when converting from extended precision to integer when storing an integer.

The extended precision multiply produces a product of 699.999..., but the product is rounded to 700.000... during the conversion from extended to double precision when the product is stored into doubleScaled.

double doubleScaled = (double)scaleFactor * value;

Since doubleScaled == 700.000..., when truncated to integer, it's still 700:

int scaledvalue1 = doubleScaled; // = 700

The product 699.999... is truncated when it's converted into an integer:

int scaledvalue2 = (double)((double)(scaleFactor) * value);  // = 699 ??

My guess here is that the compiler generated a compile time constant 0f 700.000... rather than doing the multiply at run time.

int scaledvalue3 = (double)(1000.0 * 0.7);  // = 700

This truncation issue can be avoided by using the round() function from the C standard library.

int scaledvalue2 = (int)round(scaleFactor * value);  // should == 700

score 0 · Answer 4 · answered Nov 03 '16 at 16:38

Depending on compiler and optimization flags, scaledvalue_a, involving a variable, may be evaluated at runtime using your processors floating point instructions, whereas scaledvalue_b, involving constants only, may be evaluated at compile time using a math library (e.g. gcc uses GMP - the GNU Multiple Precision math library for this). The difference you are seeing seems to be the difference between the precision and rounding of the runtime vs compile time evaluation of that expression.

score -3 · Answer 5 · answered Nov 03 '16 at 12:30

-3

Due to rounding errors, most floating-point numbers end up being slightly imprecise. For the below double to int conversion use std::ceil() API

int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??

answered Nov 03 '16 at 12:30

aks

302
2
10

using `ceil` solves for *this* particular example but it's not a general solution to problems of this kind. There are plenty of numbers where rounding down would be the correct behaviour. – Chris H Nov 03 '16 at 13:30
ceil would round up, not always wanted. round() (or roundl() if long doubles supported) should work. – rcgldr Nov 03 '16 at 13:41
For e.g. The representation of 700 in double can be like 699.99999998 or – aks Nov 03 '16 at 13:42
For e.g. The representation of 700 in double can be like 699.99999998 or 700.00001. So you can use std::ceil( value - 0.5 ) or std::floor( value + 0.5 ). If any other rounding down is possible, please update. Thank you. – aks Nov 03 '16 at 13:49
round() (or roundl() is supported from VS2013 only. https://msdn.microsoft.com/en-us/library/dn353646.aspx – aks Nov 03 '16 at 13:57
And Round up or down is not the issue.. the problem is 2 lines of code that i would expect doing the same comes up with diffrent result.. – Ratman Nov 03 '16 at 14:03

c++ (double)0.700 * int(1000) => 699 (Not the double precision issue)

5 Answers5