I have a really strange error that I've spend several days trying to figure out, and so now I want to see if anybody has any comments to help me understand what's happening.
Some background. I'm working on a software project which involves adding C++ extensions to Python 2.7.1 using Boost 1.45, so all my code is being run through the Python interpreter. Recently, I made a change to the code which broke one of our regression tests. This regression test is probably too sensitive to numerical fluctuations (e.g. different machines), so I should fix that. However, since this regression is breaking on the same machine/compiler that produced the original regression results, I traced the difference in results to this snippet of numerical code (which is verifiably unrelated to the code I changed):
c[3] = 0.25 * (-3 * df[i-1] - 23 * df[i] - 13 * df[i+1] - df[i+2]
- 12 * f[i-1] - 12 * f[i] + 20 * f[i+1] + 4 * f[i+2]);
printf("%2li %23a : %23a %23a %23a %23a : %23a %23a %23a %23a\n",i,
c[3],
df[i-1],df[i],df[i+1],df[i+2],f[i-1],f[i],f[i+1],f[i+2]);
which constructs some numerical tables. Note that:
- %a prints provides an exact ascii representation
- The left hand side (lhs) is c[3], and the rhs is the other 8 values.
- The output below was for values of i that were far from the boundaries of f, df
- this code exists within a loop over i, which itself nested several layers (so I'm unable to provide an isolated case to reproduce this).
So I cloned my source tree, and the only difference between the two executables I compile is that the clone includes some extra code which isn't even executed in this test. This makes me suspect that it must be a memory problem, since the only difference should be where the code exists in memory... Anyway, when I run the two executables, here's the difference in what they produce:
diff new.out old.out
655,656c655,656
< 6 -0x1.7c2a5a75fc046p-10 : 0x0p+0 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4
< 7 -0x1.a18f0b3a3eb8p-10 : 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 -0x1.a4acc49fef001p-6 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4 0x1.9f6a9bc4559cdp-5
---
> 6 -0x1.7c2a5a75fc006p-10 : 0x0p+0 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4
> 7 -0x1.a18f0b3a3ec5cp-10 : 0x0p+0 0x0p+0 -0x1.75eee7aa9b8ddp-7 -0x1.a4acc49fef001p-6 : 0x1.304ec13281eccp-4 0x1.304ec13281eccp-4 0x1.1eaea08b55205p-4 0x1.9f6a9bc4559cdp-5
<more output truncated>
You can see that the value in c[3] is subtly different, while none of the rhs values are different. So some how identical input is giving rise to different output. I tried simplifying the rhs expression, but any change I make eliminates the difference. If I print &c[3], then the difference goes away. If I run on two different machines (linux, osx) I have access to, there's no difference. Here's what I've already tried:
- valgrind (reported numerous problems in python, but nothing in my code, and nothing that looked serious)
- -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_ASSERT -D_GLIBCXX_DEBUG_PEDASSERT -D_GLIBCXX_DEBUG_VERIFY (but nothing asserts)
- -fno-strict-aliasing (but I do get aliasing compile warnings out of the boost code)
I tried switching from gcc 4.1.2 to gcc 4.5.2 on the machine that has the problem, and this specific, isolated difference goes away (but the regression still fails, so let's assume that's a different problem).
Is there anything I can do to isolate the problem further? For future reference, is there any way to analyze or understand this kind of problem quicker? For example, given my description of lhs changing even though rhs is not, what would you conclude?
EDIT:
The problem was entirely due to -ffast-math
.