64 bit floating point porting issues

Question

I'm porting my application from 32 bit to 64 bit. Currently, the code compiles under both architectures, but the results are different. For various reasons, I'm using floats instead of doubles. I assume that there is some implicit upconverting from float to double happening on one machine and not the other. Is there a way to control for this, or specific gotchas I should be looking for?

edited to add:

32 bit platform

 gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
 Dual-Core AMD Opteron(tm) Processor 2218 HE

64 bit platform

 gcc (Ubuntu 4.3.3-5ubuntu4) 4.3.3
 Intel(R) Xeon(R) CPU

Applying the -mfpmath=387 helps somewhat, after 1 iteration of the algorithm the values are the same, but beyond that they fall out of sync again.

I should also add that my concern isn't that the results aren't identical, it's that porting to a 64 bit platform has uncovered a 32 bit dependency of which I was not aware.

Yeah, you can force it to use the FPU. As I noted in my answer below try -mfpmath=387 to cause it to only use SSE when using SSE intrinsics. — Edward Kmett, Jul 02 '09 at 19:30
http://stackoverflow.com/questions/982421/how-to-write-portable-floating-point-arithmetic-in-c/ — MSN, Jul 02 '09 at 20:19

score 11 · Answer 1 · answered Jul 02 '09 at 19:24

Your compiler is probably using SSE opcodes to do most of its floating point arithmetic on the 64 bit platform assuming x86-64, whereas for compatibility reasons it probably used the FPU before for a lot of its operations.

SSE opcodes offer a lot more registers and consistency (values always remain 32 bits or 64 bits in size), while the FPU uses 80 bit intermediate values when possible. So you were most likely benefitting from this improved intermediate precision before. (Note the extra precision can cause inconsistent results like x == y but cos(x) != cos (y) depending on how far apart the computations occur!)

You may try to use -mfpmath=387 for your 64 bit version since you are compiling with gcc and see if your results match your 32 bit results to help narrow this down.

CB Bailey · Accepted Answer · 2009-07-02T19:30:33.910

There is no inherent need for floats and doubles to behave differently between 32-bit and 64-bit code but frequently they do. The answer to your question is going to be platform and compiler specific so you need to say what platform you are porting from and what platform you are porting to.

On intel x86 platforms 32-bit code often uses the x87 co-processor instruction set and floating-point register stack for maximum compatibility whereas on amb64/x86_64 platforms, the SSE* instructions and xmm* registers and are often used instead. These have different precision characteristics.

Post edit:

Given your platform, you might want to consider trying the -mfpmath=387 (the default for i386 gcc) on your x86_64 build to see if this explains the differing results. You may also want to look at the settings for all the -fmath-* compiler switches to ensure that they match what you want in both builds.

The defaults are all specified in the info pages. Try 'info gcc' then search for '-fmath'. — CB Bailey, Jul 02 '09 at 20:17

score 3 · Answer 3 · answered Jul 02 '09 at 19:31

Like others have said, you haven't provided enough information to tell exactly what's going on. But in a general sense, it seems you've been counting on some kind of floating point behavior that you shouldn't be counting on.

99 times out of 100 the problem is that you're comparing two floats for equality somewhere.

If the problem is simply that you're getting slightly different answers, you need to realize that neither one is "correct" -- some sort of rounding is going to be taking place no matter what architecture you're on. It's a matter of understanding the significant digits in your calculations, and being aware that any values you're coming up with are approximations to a certain degree.

score 3 · Answer 4 · edited May 23 '17 at 12:13

The x87 FPU's 80-bit internal registers cause its floating point results to differ slightly from other FPUs that use 64-bit internally (like on x86_64). You will get different results between these processors unless you don't mind taking large performance hits by flushing things out to memory or doing other "strictfp" tricks.

See also: Floating point rounding when truncating

And: http://docs.sun.com/source/806-3568/ncg_goldberg.html

score 2 · Answer 5 · answered Jul 02 '09 at 19:59

On x64, the SSE2 instruction set is used, while in 32-bit apps, the x87 FPU is often the default.

The latter internally stores all floating-point values in a 80-bit format. The latter uses plain 32-bit IEEE floats.

Apart from that, an important point to make is that you shouldn't rely on your floating-point math being identical across architectures.

Even if you use 32-bit builds on both machines, there's still no guarantee that Intel and AMD will yield identical results. Of course, when one of them runs a 64-bit build, you only add more uncertainty.

Relying on the precise results of a floating-point operation would almost always be a bug.

Enabling SSE2 on the 32-bit version as well would be a good start, but again, don't make assumptions about floating-point code. There is always a loss of precision, and it's a bad idea to assume that this loss is predictable, or that it can be reproduced between CPU's or different builds.

score 0 · Answer 6 · answered Jul 02 '09 at 19:28

0

The gnu compiler has a lot of compiler options related to floating point numbers that can cause calculations to break under some circumstances. Just search this page for the term "float" and you'll find them.

answered Jul 02 '09 at 19:28

Brian

25,523
18
82
173

score 0 · Answer 7 · answered Jul 02 '09 at 19:56

It's really hard to control a lot of this stuff.

For a start, the C standard often calls for operations to floats to be done in "double-space" and converted back to floats.

Intel processors have 80 bits of precision in the registers they use to many of these operations, and then they drop that to 64 bits when it's stored to main memory. That means that the value of a variable may change for no apparent reason.

You can use things like GnuMP if you really care, and I'm sure there are other libraries that guarantee consistent results. Most of the time the amount of error/jitter generated is below the real-world resolution that you need.

score 0 · Answer 8 · answered Jul 02 '09 at 20:17

The really hard part to get is that both sets of results are correct. It is not fair to characterize the changes as anything but "different." Perhaps there is an increased emotional attachment to the older results...but there is no mathematical reason to prefer the 32 bit results over the 64bit results.

Have you considered a change to use fixed point math for this application? Not only is fixed point math stable across changes of chip, compiler, and libraries, in many cases it is faster than floating point math too.

As a quick test, move the binary from the 32bit system to the 64bit system and run it. Then rebuild the app on the 64bit system as a 32bit binary, and run that. That might help to identify what change(s) are actually producing the divergent behavior.

score 0 · Answer 9 · answered Jul 03 '09 at 02:17

As already mentioned, being different should not be a problem, as long as they are both correct. Ideally, you should have unit tests for this kind of things (pure computation usually falls into the relatively easy to test camp).

It is basically impossible to guarantee the same results across CPU and toolchains (one compiler flag can change a lot already), and it is already very hard to be consistent. Designing robust floating point code is a hard task, but fortunately, in many cases, precision is not an issue.

score 0 · Answer 10 · answered Apr 03 '15 at 23:13

One major thing to watch out for is that the C language originally specified that a computation like

float a=b+c+d;

would convert b, c, and d to the longest available floating-point type (which happened to be type double), add them together, and then convert the result to float. Such semantics were simple for the compiler and helpful for the programmer, but had a slight difficulty: the most efficient format for storing numbers isn't the same as the most efficient format for performing computations. On machines without floating-point hardware, it's faster to perform computations on a value stored as a not-necessarily-normalized 64-bit mantissa and a separately-stored 15-bit exponent and sign, then to operate on values stored as a 64-bit double which must be unpacked before every operation and then normalized and repacked after (even if only to be immediately unpacked for the next operation). Having machines keep intermediate results in the longer format improved both speed and accuracy; ANSI C allowed for this with type long double.

Unfortunately, ANSI C failed to provide a means by which variable-argument functions could indicate whether they wanted all floating-point values to be converted to long double, all converted to double, or have float and double passed as double and long double as long double. Had such a facility existed, it would have been easy to make code which wouldn't have to distinguish between double and long double values. Unfortunately, the lack of such a feature means that on systems where double and long double are different types code does have to care about the distinction, and on systems where they aren't it doesn't. This in turn means that a lot of code written on systems where the types are the same will break on systems where they aren't; compiler vendors decided the easiest fix was to simply make long double be synonymous with double and not providing any type that could hold intermediate computations accurately.

Since having intermediate computations performed in an unrepresentable type is bad, some people decided the logical thing was to have computations on float be performed as type float. While there are some hardware platforms where this may be faster than using type double, it often has undesirable consequences for accuracy. Consider:

float triangleArea(float a, float b, float c)
{
  long double s = (a+b+c)/2.0;
  return sqrt((s-a)*(s-b)*(s-c)*c);
}

On systems where intermediate computations are performed using long double, this will yield good accuracy. On systems where intermediate computations are performed as float, this may yield horrible accuracy even when a, b, and c are all precisely representable. For example, if a and b are 16777215.0f and c is 4.0f, the value of s should be 16777217.0, but if the sum of a, b, and c is computed as float, it will be 1677216.0; this will yield an area which is less than half correct value. If a and c were 16777215.0f and b was 4.0f (same numbers; different order) then s would get computed as 16777218.0, yielding an area which is 50% too big.

If you have calculations which yield good results on x86 (many compilers for which eagerly promote to an 80-bit type even though they unhelpfully make it unavailable to the programmer) but lousy results on x64, I would guess you may have a calculation like the above which needs to have intermediate steps performed at higher precision than the operands or final result. Changing the first line of the above method to:

  long double s = ((long double)a+b+c)/2.0;

will force the intermediate computations to be done in higher-precision, rather than performing the computations at low-precision and then storing the inaccurate result into a higher-precision variable.

64 bit floating point porting issues

10 Answers10

Linked