Storing a variable changes result of trivial operation with -O1

Question

I have a Fortran program that gives different results with -O0 and -O1 in 32bit systems. Tracking down the difference, I came up with the following test case (test.f90):

program test
implicit none
character foo
real*8 :: Fact,Final,Zeta,rKappa,Rnxyz,Zeta2

read(5,*) rKappa
read(5,*) Zeta
backspace(5)
read(5,*) Zeta2
read(5,*) Rnxyz

Fact=rKappa/Sqrt(Zeta**3)

write(6,'(ES50.40)') Fact*Rnxyz

Fact=rKappa/Sqrt(Zeta2**3)
Final = Fact*Rnxyz
write(6,'(ES50.40)') Final

end program test

with this data file:

4.1838698196228139E-013
20.148674000000000     
-0.15444754236171612

The program should write exactly the same number. Note that Zeta2 is the same as Zeta, since the same number is read again (this is to prevent the compiler realizing they are the same number and hiding the problem). The only difference is that first an operation is done "on the fly" when writing, and then the result is saved in a variable and the variable is printed.

Now I compile with gfortran 4.8.4 (Ubuntu 14.04 version) and run it:

$ gfortran -O0 -m32 test.f90 && ./a.out < data
   -7.1447898573566615177997578153994664188136E-16
   -7.1447898573566615177997578153994664188136E-16

$ gfortran -O1 -m32 test.f90 && ./a.out < data
   -7.1447898573566615177997578153994664188136E-16
   -7.1447898573566605317236262891347096541529E-16

So, with -O0 the numbers are identical, with -O1 they are not.

I tried checking the optimized code with -fdump-tree-optimized:

  final.10_53 = fact_44 * rnxyz.9_52;
  D.1835 = final.10_53;
  _gfortran_transfer_real_write (&dt_parm.5, &D.1835, 8);
  [...]
  final.10_63 = rnxyz.9_52 * fact_62;
  final = final.10_63;
  [...]
  _gfortran_transfer_real_write (&dt_parm.6, &final, 8);

The only difference I see is that in one case the number printed is fact*rnxyz, and in the other it is rnxyz*fact. Can this change the result? From High Performance Mark's answer, I guess that might have to do with which variable goes to which register when. I also tried looking at the assembly output generated with -S, but I can't say I understand it.

And then, without the -m32 flag (on a 64bit machine), the numbers are also identical...

Edit: The numbers are identical if I add -ffloat-store or -mfpmath=sse -sse2 (see here, at the end). This makes sense, I guess, when I compile in an i686 machine, as the compiler would by default use 387 math. But when I compile in an x86-64 machine, with -m32, it shouldn't be needed according to the documentation:

-mfpmath=sse [...]

For the i386 compiler, you must use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For the x86-64 compiler, these extensions are enabled by default.

[...]

This is the default choice for the x86-64 compiler.

Maybe -m32 makes these "defaults" ineffective? However, running gfortran -Q --help=target says mfpmath is 387 and msse2 is disabled...

Check the assembly generated code. O1 may do apply some optimizations to the code you posted. You may be interested on http://stackoverflow.com/questions/19618679/floating-point-optimizations-guideline additionally. — Harald, Oct 09 '16 at 09:27
@Harald I checked the output with `-S`, but that starts to be too obscure... In any case, I can't see any difference there either. I'm updating the question with the (AFAICT) relevant part. — Jellby, Oct 09 '16 at 09:35
@Harald I had made a mistake in the tests, so the diagnostic is a bit different. It seems there's a change in the order of factors. — Jellby, Oct 09 '16 at 10:19

High Performance Mark · Answer 1 · 2016-10-09T12:39:59.920

1

Too long for a comment, but more of a suspicion than an answer. OP writes

The only difference is that first an operation is done "on the fly" when writing, and then the result is saved in a variable and the variable is printed.

which has me thinking about the x86_64 architecture's internal 80-bit f-p arithmetic. The precise results of a sequence of f-p arithmetic operations will be affected by when intermediate values are trimmed from 80- to 64-bits. And that's the kind of thing which may differ from one compiler optimisation level to another.

Note too that the differences between the two numbers printed by the O1 version of the code kick in at the 15th decimal digit, about the limits of precision available in 64-bit f-p arithmetic.

Some more fiddling around gives

1 01111001100 1001101111011110011111001110101101101100011000001110

as the IEEE-754 representation of

-7.1447898573566615177997578153994664188136E-16

and

1 01111001100 1001101111011110011111001110101101101100011000001101

as the IEEE-754 representation of

-7.1447898573566605317236262891347096541529E-16

The two numbers differ by 1 in their significands. It's possible that at O0 your compiler adheres to IEEE-754 rules for f-p arithmetic (those rules are strict about matters such as rounding at the low-order bits) but at O1 adheres only to Fortran's rather more relaxed view of arithmetic. (The Fortran standard does not require the use of IEEE-754 arithmetic.)

You may find a compiler option to enforce adherence to IEEE-754 rules at higher levels of optimisation. You may also find that that adherence costs you a measurable amount of run time.

edited Oct 09 '16 at 12:39

answered Oct 09 '16 at 09:47

High Performance Mark

77,191
7
105
161

Unfortunately, in the actual application this mismatch grows to more that 1e-08. Maybe the algorithm and test case are too sensitive, or maybe there are some other reasons for the final discrepancy besides this one. (Note I've modified the conclusion in the question, there's actually a difference in the order of factors.) – Jellby Oct 09 '16 at 10:22
@Jellby And does the difference have some real significance, or are you just predicting the weather for next Christmas and getting two different answrs? – Vladimir F Героям слава Oct 09 '16 at 12:22
@VladimirF It is significant to the extent that it makes a test suite fail on a 32bit machine with `-O0`, because it accepts differences of up to 1e-08. But given the different results I get with the above sample, I would expect the test to fail with `-O1` instead, so maybe the problem is something different. – Jellby Oct 09 '16 at 12:36
Sure, but is the limit in the testsuit relevant? – Vladimir F Героям слава Oct 09 '16 at 13:06
As preceding comments indicated, you appear to have selected 387 code generation. At O1 some of the data narrowing roundoff steps are skipped. If you want consistent avoidance of extra precision you could set e.g.-march=native. If you want precision consistently according to your 387 precision mode setting you would declare real (10) or real*10. – tim18 Oct 09 '16 at 13:10
@tim18 I hadn't willingly selected 387 code. `-march=native` didn't help, but see the updated question about `-mfpmath=sse -msse2`. – Jellby Oct 09 '16 at 18:07

Storing a variable changes result of trivial operation with -O1

1 Answers1