Is it possible to clear the FPU?

Question

I'm using Delphi XE6 to perform a complicated floating point calculation. I realize the limitations of floating point numbers so understand the inaccuracies inherent in FP numbers. However this particular case, I always get 1 of 2 different values at the end of the calculation.
The first value and after a while (I haven't figured out why and when), it flips to the second value, and then I can't get the first value again unless I restart my application. I can't really be more specific as the calculation is very complicated. I could almost understand if the value was somewhat random, but just 2 different states is a little confusing. This only happens in the 32-bit compiler, the 64 bit compiler gives one single answer no matter how many times I try it. This number is different from the 2 from the 32-bit calculation, but I understand why that is happening and I'm fine with it. I need consistency, not total accuracy.
My one suspicion is that perhaps the FPU is being left in a state after some calculation that affects subsequent calculations, hence my question about clearing all registers and FPU stack to level out the playing field. I'd call this CLEARFPU before I start of the calculation.

After some more investigation I realized I was looking in the wrong place. What you see is not what you get with floating point numbers. I was looking at the string representation of the numbers and thinking here are 4 numbers going into a calculation ALL EQUAL and the result is different. Turns out the numbers only seemed to be the same. I started logging the hex equivalent of the numbers, worked my way back and found an external dll used for matrix multiplication the cause of the error. I replaced the matrix multiplication with a routine written in Delphi and all is well.

You can try Set8087CW for 32-bit code (64-bit code should use SSE, not FPU) — MBo, Nov 04 '15 at 17:18
Such guesswork is not constructive. The asker needs to face up to the need to understand the problem. — David Heffernan, Nov 04 '15 at 17:46
Try to insert `Get8087CW` calls before and after the calc result change and see if there is a difference in the returned values. — kludg, Nov 04 '15 at 17:49
Regarding the 64bit/32bit differences. With Delphi, you can have different values for uninitialized local variables depending on whether it's the 64 bit or 32 bit compiler. We recently had an example where the 32 bit compiler was setting an uninitialized local boolean variable to false and the 64 bit compiler just happened to use true (or vice versa). You problem sounds very much like an uninitialized local variable. In our case there was no compiler warning because it was passed into a method. — Graymatter, Nov 05 '15 at 06:02

David Heffernan · Accepted Answer · 2015-11-04T17:51:59.790

Floating point calculations are deterministic. The inputs are the input data and the floating point control word. With the same input, the same calculation will yield repeatable output.

If you have unpredictable results, then there will be a reason for it. Either the input data or the floating point control word is varying. You have to diagnose what that reason for that is. Until you understand the problem fully, you should not be looking for a problem. Do not attempt to apply a sticking plaster without understanding the disease.

So the next step is to isolate and reproduce the problem in a simple piece of code. Once you can reproduce the issue you can solve the problem.

Possible explanations include using uninitialized data, or external code modifying the floating point control word. But there could be other reasons.

Uninitialized data is plausible. Perhaps more likely is that some external code is modifying the floating point control word. Instrument your code to log the floating point control word at various stages of execution, to see if it ever changes unexpectedly.

The floating point control word is, as far as I can see not changing. The weird thing is the two different results...but only two. the calculation is pretty complex and goes through a lot of iterations. We can see a distinct point at which the two version diverge (so it's not an initial state thing), but can't figure out why. Some more logging perhaps. — Steve, Nov 04 '15 at 21:55
You definitely need to identify the cause for this. So lots of logging should identify it. — David Heffernan, Nov 04 '15 at 22:08
I accepted this answer as yes FP calculations are deterministic. The problem was the input data. — Steve, Nov 06 '15 at 23:46

score 5 · Answer 2 · answered Nov 04 '15 at 20:33

5

You've probably been bitten by combination of optimization and excess x87 FPU precision resulting in the same bit of floating-point code in your source code being duplicated with different assembly code implementations with different rounding behaviour.

The problem with x87 FPU math

The basic problem is that while x87 FPU the supports 32-bit, 64-bit and 80-bit floating-point value, it only has 80-bit registers and the precision of operations is determined by the state of the bits in the floating point control word, not the instruction used. Changing the rounding bits is expensive, so most compilers don't, and so all floating point operations end being be performed at the same precision regardless of the data types involved.

So if the compiler sets the FPU to use 80-bit rounding and you add three 64-bit floating point variables, the code generated will often add the first two variables keeping the unrounded result in a 80-bit FPU register. It would then add the third 64-bit variable to 80-bit value in the register resulting in another unrounded 80-bit value in a FPU register. This can result in a different value being calculated than if the result was rounded to 64-bit precision after each step.

If that resulting value is then stored in a 64-bit floating-point variable then the compiler might write it to memory, rounding it to 64 bits at this point. But if the value is used in later floating point calculations then the compiler might keep it in a register instead. This means what rounding occurs at this point depends on the optimizations the compiler performs. The more its able to keep values in a 80-bit FPU register for speed, the more the result will differ from what you'd get if all floating point operation were rounded according to the size of actual floating point types used in the code.

Why SSE floating-point math is better

With 64-bit code the x87 FPU isn't normally used, instead equivalent scalar SSE instructions are used. With these instructions the precision of the operation used is determined by the instruction used. So with the adding three numbers example, the compiler would emit instructions that added the numbers using 64-bit precision. It doesn't matter if the result gets stored in memory or stays in register, the value remains the same, so optimization doesn't affect the result.

How optimization can turn deterministic FP code into non-deterministic FP code

So far this would explain why you'd get a different result with 32-bit code and 64-bit code, but it doesn't explain why you can get a different result with the same 32-bit code. The problem here is that optimizations can change the your code in surprising ways. One thing the compiler can do is duplicate code for various reasons, and this can cause the same floating point code being executed in different code paths with different optimizations applied.

Since optimization can affect floating point results this can mean the different code paths can give different results even though there's only one code path in the source code. If the code path chosen at run time is non-deterministic then this can cause non-deterministic results even when the in the source code the result isn't dependent on any non-deterministic factor.

An example

So for example, consider this loop. It performs a long running calculation, so every few seconds it prints a message letting the user know how many iterations have been completed so far. At the end of the loop there's simple summation performed using floating-point arithmetic. While there's non-deterministic factor in the loop, the floating-point operation isn't dependent on it. It's always performed regardless of whether progress updated is printed or not.

while ... do
begin
    ...
    if TimerProgress() then
    begin
        PrintProgress(count);
        count := 0
    end
    else
        count := count + 1;
    sum := sum + value
end

As optimization the compiler might move the last summing statement into the end of both blocks of the if statement. This lets both blocks finish by jumping back to the start of the loop, saving a jump instruction. Otherwise one of the blocks has to end with a jump to the summing statement.

This transforms the code into this:

while ... do
begin
    ...
    if TimerProgress() then
    begin
        PrintProgress(count);
        count := 0;
        sum := sum + value
    end
    else
    begin
        count := count + 1;
        sum := sum + value
    end
end

This can result in the two summations being optimized differently. It may be in one code path the variable sum can be kept in a register, but in the other path its forced out in to memory. If x87 floating point instructions are used here this can cause sum to be rounded differently depending on a non-deterministic factor: whether or not its time to print the progress update.

Possible solutions

Whatever the source of your problem, clearing the FPU state isn't going to solve it. The fact that the 64-bit version works, provides an possible solution, using SSE math instead x87 math. I don't know if Delphi supports this, but it's common feature of C compilers. It's very hard and expensive to make x87 based floating-point math conforming to the C standard, so many C compilers support using SSE math instead.

Unfortunately, a quick search of the Internet suggests the Delphi compiler doesn't have option for using SSE floating-point math in 32-bit code. In that case your options would be more limited. You can try disabling optimization, that should prevent the compiler from creating differently optimized versions of the same code. You could also try to changing the rounding precision in the x87 floating-point control word. By default it uses 80-bit precision, but all your floating point variables are 64-bit then changing the FPU to use 64-bit precision should significantly reduce the effect optimization has on rounding.

To do the later you can probably use the Set8087CW procedure MBo mentioned, or maybe System.Math.SetPrecisionMode.

answered Nov 04 '15 at 20:33

Ross Ridge

38,414
7
81
112

It's perfectly possible to get repeatable results on x87. I know that my FEA code achieves that. No need for SSE at all. – David Heffernan Nov 04 '15 at 21:06
@DavidHeffernan Of course it's possible, I never said otherwise. The problem is that it's possible not to get repeatable results on x87 even though there's nothing wrong with your code. – Ross Ridge Nov 04 '15 at 21:23
1

Not so. The same code, with the same data, will produce the same answers. My problem with what you say is that you seem to imply that sse calcs are repeatable, but x87 are not. Which is plain wrong. – David Heffernan Nov 04 '15 at 21:27
@DavidHeffernan I've explained how its possible for the result to change despite the fact that there's nothing wrong with the code. All it takes is a couple of very simple optimizations. – Ross Ridge Nov 04 '15 at 21:39
No. The same code, and I mean compiled code, with the same input, produces the same output. If you want to demonstrate otherwise make a small complete program whose results are not repeatable. – David Heffernan Nov 04 '15 at 22:05
@DavidHeffernan If you want to demonstrate that my answer wrong you need to say something actually refutes something I've actually written. – Ross Ridge Nov 04 '15 at 22:25
2

No. It's the other way around. You claim that the same code can yield different answers when given the same input. You've given no evidence. The IEEE754 standard mandates reproducibility. – David Heffernan Nov 04 '15 at 22:29
FWIW, David Heffernan is correct. Floating-point is not a stochastic tool. It is reproducible to the extent he implies. The point of variation might be uninitialized data, a bug in the kernel's FPR+FPCW save / restore, thermal problems in the processor, someone forgetting to use emms after MMX, and other arcane chaos. Emms leads to NaNs, so seems unlikely. Thermal problems on Intel are rare unless you are overclocking. I see no reason to believe Windows' FPR save/restore is faulty. So, I would go with likelihood of uninitialized data. Find the point of divergence and report back. – Ian Ollmann Nov 04 '15 at 22:38
Please lump under "uninitialized data" other data corruption issues like multithreaded race conditions and array overruns / memory smashers. – Ian Ollmann Nov 04 '15 at 22:54
Essentially you are saying that different code gives different answers, and nobody will argue with that. But we are talking about executing the same code multiple times. And that, with the provisos of same input data and state, results in the same output. – David Heffernan Nov 04 '15 at 23:00
@DavidHeffernan No, I'm saying the same source code can give different answers because, as I explained, the same bit of source code can generate different sequences x87 FPU instructions in the same executable that result in a differently rounded result. Which sequence of instructions in the executable gets executed can end up depending on some non-deterministic factor despite the fact the source code lines that generated these instruction sequences aren't dependent on the non-deterministic factor. I've explained in detail how this can happen in my answer. – Ross Ridge Nov 04 '15 at 23:30
The same source code generating different sequences of FPU instructions is not the problem the OP has described. OP said that the values change in the executable while it is running. – Graymatter Nov 05 '15 at 06:09
@Graymatter My answer shows how the value can change while its running because the sequence of FPU instructions being executed changes while its running. – Ross Ridge Nov 05 '15 at 06:18
Are you talking about branch prediction fail? (http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array). That wouldn't change the result. What I get from your answer is that the compiler can produce different code depending on the optimizations. I can't see how the same set of instructions can produce a different result if executed multiple times. That would be a huge flaw in a processor if the same code didn't produce the same result every time using the same data. – Graymatter Nov 05 '15 at 06:31
@Graymatter No, I'm not talking about branch prediction. My answer shows how a compiler can generate two different code sequences for the same source code statement, it shows how these two code sequences can generate different results for the same data because of rounding differences, and it shows how which code sequence gets executed can change depending a non-deterministic factor. The value doesn't change because the CPU executes the instructions differently, but because the CPU executes different instructions. – Ross Ridge Nov 05 '15 at 06:49
Your latest string of comments confirm what I said. You are describing a scenario where different code produces different results. The processor does not execute source code. It executes machine code. Nobody disputes the fact that different machine instructions give different results. But the asker says that the same code gives different results. You have missed the point completely. – David Heffernan Nov 05 '15 at 07:01
But that goes back to what I said before. It's the compiler producing different sets of instructions. That doesn't match what the OP described where the same code is producing two different results. – Graymatter Nov 05 '15 at 07:03
@Graymatter The compiler can generate two different sequences of CPU instructions for the same line of source code in the same executable. Which sequence of instructions in the same executable that gets executed can be dependent on a non-deterministic factor that can change during the running of the program. – Ross Ridge Nov 05 '15 at 07:21
Of course if the same code is repeated or a method is marked as inline then the compiler can produce different machine code in the two different places. That's not an issue here as the OP has indicated that the same code is producing different results. – Graymatter Nov 05 '15 at 07:32
@Graymatter All we know about the original poster's code is that "the calculation is very complicated". We don't know that calculation doesn't have inlined methods, or something else that might cause the compiler to generate different instruction sequences for the same bit of source code. We don't know that it doesn't contain something like the example I gave. – Ross Ridge Nov 05 '15 at 07:51
Inlined methods explain nothing. If the same function is called, with the same input data, then the same inlined code will be executed. You've written a long answer stating the self-evident fact that different code produces different output. If that's what you want to say, you should perhaps state it more directly. Certainly intimating that x87 is worse than SSE is misleading at the very best. – David Heffernan Nov 05 '15 at 09:03
Anyway, there's little point in continuing the discussion. We all know where we stand now. You believe that the problem is caused by the program switching to a different path of execution. That's certainly plausible even if the question indicates otherwise. There's no need for you to introduce all the details that you do. You can simply state that different code produces different output, remember of course that the machine executes machine code rather than source code. – David Heffernan Nov 05 '15 at 09:05