What happens to the value of a floating point number when it's assigned to a long double?

Question

Edit: I've realised that I'm working with the type long doubleand not just double which does make a difference. I've also added an example from my program below that reproduces the error in question.

Note: I'm currently working in C++11 and using GCC to compile.

I'm dealing with a situation where the result varies between the below two calculations:

value1 = x * 6.0;

double six = 6.0;
value2 = x * six;

value1 != value2

Where all variables above are of type long double.

Essentially, I wrote a line of code that gives me an incorrect answer when I use 6.0 in the actual calculation. Whereas, if I assign 6.0 to a variable of type long double first then use that variable in the calculation I receive the correct result.

I understand the basics of floating point arithmetic, and I guess it's obvious that something is happening to the bits of 6.0 when it is assigned to the long double type.

Sample from my actual program (I left the calculation as is to ensure the error is reproducible):

#include <iomanip>
#include <math.h>

long double six = 6.0;
long double value1;
long double value2;

value1 = (0.7854 * (pow(10, 5)) * six * (pow(0.033, 2)) * 1.01325 * (1.27 * 11.652375 / 1.01325 - 1.0));
value2 = (0.7854 * (pow(10, 5)) * 6.0 * (pow(0.033, 2)) * 1.01325 * (1.27 * 11.652375 / 1.01325 - 1.0));

std::cout << std::setprecision(25) << value1 << std::endl;
std::cout << std::setprecision(25) << value2 << std::endl;

Where the output is:

7074.327896870849993415931
7074.327896870850054256152

Also, I understand how floating point calculations only hold precision up to a certain number of bits (so setting such high precision shouldn't effect results, e.g. after 15-17 digits it should really matter if numbers vary but unfortunately this does affect my calculation).

Question: Why are the above two code segments producing (slightly) different results?

Note: I'm not simply comparing the two numbers with == and receiving false. I've just been printing them out using setprecision and checking each digit.

even with only doubles or only floats you should not rely on `3.0 + 3.0 == 6.0` — 463035818_is_not_an_ai, Jun 22 '16 at 19:59
@RichardCritten I disagree, I've read that question many times and I fully understand the topics discussed there. — Paul Warnick, Jun 22 '16 at 20:00
@PaulWarnick As per your note you should show how you are printing and what you receive. — Fantastic Mr Fox, Jun 22 '16 at 20:02
whats the data type of value1 and x? the result will depend on their data types too — Zaki Mustafa, Jun 22 '16 at 20:02
I know I've seen similar questions to this, but I can't find them. The difference has to do with the precision that it uses for in-line addition versus losing precision when it saves the value in a variable. — Barmar, Jun 22 '16 at 20:04
@ZakiMustafa I've edited the question to include that they are type double — Paul Warnick, Jun 22 '16 at 20:04
@Barmar Yeah I tried to look for a solution for this but I couldn't find one on SE (surprisingly due to the simplicity of the question). Also, that was along the lines of my thinking. — Paul Warnick, Jun 22 '16 at 20:06
@PaulWarnick, I don't see any reason the two should be different. Perhaps the compiler is optimizing them differently or something weird. — chris, Jun 22 '16 at 20:10
You are basically doing something that is ill-defined for float/doubles. You cannot expect == or != to give a perfect result. I understand what you are asking here, but in truth, once you start worrying about the froth fraction digits in floating point types, you are already doing it wrong. As to why using a double 6.0 gives a different result that letting the compiler do some work, I'd guess this has to do with a standard somewhere or another. I'll let one of the guys who has those links handy answer that assertion. — Michael Dorgan, Jun 22 '16 at 20:10
I'm not exactly sure how this is considered the same question as http://stackoverflow.com/questions/588004/is-floating-point-math-broken as it doesn't answer what happens to the value of a double when used inline vs assigning it to a variable first (which is my question). — Paul Warnick, Jun 22 '16 at 20:11
Then re-define your question to ask why the compiler used statement is specifically different than placing it in a double class first and add info about the compiler being used, soft/hard fp support, etc. Right now, it reads very very close to the given question. — Michael Dorgan, Jun 22 '16 at 20:12
@chris My thinking as well, I was just trying to see if this is a common issue that has a standard solution / explanation. — Paul Warnick, Jun 22 '16 at 20:13
As far as I can see the first answer to the duplicate question precisely answers this question. — Galik, Jun 22 '16 at 20:13
@Galik I've never used the chat for SE before but is this something I would be able to discuss with a few people using it? — Paul Warnick, Jun 22 '16 at 20:16
@PaulWarnick I have retracted my duplicate vote because I figured out what you are actually asking. However I am unable to reproduce, can you provide a working example? — Galik, Jun 22 '16 at 20:22
@Galik Thank you! Yes I'll edit to provide a working example (although it might not be pretty). — Paul Warnick, Jun 22 '16 at 20:23
@PaulWarnick In addition to a complete, buildable code for repro, please also state the exact command you used when compling this code with `gcc`. — njuffa, Jun 22 '16 at 20:43
@njuffa I will make the changes so that it's fully reproducible (right now it's just a sample so you can see what exactly is happening) + I'll add in the gcc commands. — Paul Warnick, Jun 22 '16 at 20:47
@PaulWarnick Your question says the variables are of type `double`, but the code you edited in just now shows they are of type `long double`. I would suggest cleaning up this inconsistency. — njuffa, Jun 22 '16 at 20:48
@PaulWarnick can you try making the literal a long double (i.e. 6.0L)? — Anon Mail, Jun 22 '16 at 20:52
@AnonMail You're exactly right, the fact that I forgot I was using long doubles has completely turned me around. By changing it to 6.0L the numbers become the same. Thank you. — Paul Warnick, Jun 22 '16 at 20:55
@Paul Warnick Compiling with godbolt shows that the first expression is computed at runtime using x87 FPU, the second expression is computed at compile time. Try `const long double six = 6.0` and see whether that makes the answer the same (by allowing compile-time computation for the first variant as well). — njuffa, Jun 22 '16 at 21:02
Given that now we have a `long double` being assigned a `double` is this not just a *duplicate*? — Galik, Jun 22 '16 at 21:09
@Galik A duplicate of http://stackoverflow.com/questions/588004/is-floating-point-math-broken ? I don't think so, surprisingly the issue was simply that I was calculating with both double and long double (when I thought I was using long double for everything). — Paul Warnick, Jun 22 '16 at 21:16
I suppose now the question title itself doesn't really make sense but I got an answer to my problem. (I'm not sure what to change the title to so that it implies the question "Why are my calculations different", instead of what's happening with long doubles (which is not the problem)). — Paul Warnick, Jun 22 '16 at 21:18
I believe it is a duplicate because the *real* issue is that you are loading a *long double* with a value that is a *decimal representation* of a number that can not be accurately represented in a long double. That issue does not affect the results if the *numeric type* is identical in *both* examples because the *errors* are the same. But when you have a *double* and a *long double* the errors are now different. — Galik, Jun 22 '16 at 21:28
`6.0` is exactly reprsentable in a `float`, int a `double`, or a `long double`. As I alluded to above, the main differences between the two original variants seems to be that one is computed at runtime, the other at compile time, and there may well be discrepancies between the two modes of evaluation (without consulting the source code of the compiler, compile-time computation is a black box). At least that is what a look at the disassembled object code would suggest. — njuffa, Jun 22 '16 at 21:35
@njuffa Looking at the new example I doubt either of those calculations is happening at compile time (calling `pow()`). I suspect it is because one calculation is promoted to `long double` by the inclusion of the variable `six` whereas the other is calculated to only `double` precision because it contains only doubles. — Galik, Jun 22 '16 at 21:40
@PaulWarnick I actually think you have presented us with 3 separate problems at different times, one of which turned out to be a non-problem and another turned out to have been over-simplified so as to not actually be the right problem. — Galik, Jun 22 '16 at 21:45
So I think the *actual* problem is one of *promotion* as I described in my previous but one post. That's not a duplicate of the other post but it probably does have duplicates somewhere. — Galik, Jun 22 '16 at 21:49
@Galik Both scenarios could apply. Without knowing the gcc version, and exact compiler switches, and full, buildable code, this can't be diagnosed conclusively. I will vote to close. — njuffa, Jun 22 '16 at 21:54
I was unable to locate a duplicate (I am sure there must be some) so I provided an answer. I think the main thing is to always provide a **working example** that reproduces the error. Then you can't go wrong:) — Galik, Jun 22 '16 at 21:59

Galik · Accepted Answer · 2016-06-22T22:09:33.123

The problem here I believe is one of promotion.

long double six = 6.0;
long double value1;
long double value2;

value1 = (0.7854 * (pow(10, 5)) * six * (pow(0.033, 2)) * 1.01325 * (1.27 * 11.652375 / 1.01325 - 1.0));
value2 = (0.7854 * (pow(10, 5)) * 6.0 * (pow(0.033, 2)) * 1.01325 * (1.27 * 11.652375 / 1.01325 - 1.0));

Looking at the second calculation we notice that every term in the expression is type double. This means the whole expression will be evaluated to double precision.

However the first calculation contains the variable six that is of type long double. This will cause the entire expression to be calculated at the higher precision of a long double.

So this difference in the calculation's precision is likely the cause of the discrepancy. The whole of the first expression is promoted to long double precision but the second calculation is calculated only to double precision.

In fact a simple change to the code can prove this. If we change the type of the term 6.0 from double to long double by writing 6.0L we will get identical results because both expressions are now calculated to the same precision:

value1 = (0.7854 * (pow(10, 5)) * six * (pow(0.033, 2)) * 1.01325 * (1.27 * 11.652375 / 1.01325 - 1.0));
value2 = (0.7854 * (pow(10, 5)) * 6.0L * (pow(0.033, 2)) * 1.01325 * (1.27 * 11.652375 / 1.01325 - 1.0));

What happens to the value of a floating point number when it's assigned to a long double?

1 Answers1

Linked