21

Similar to the SO question of What does gcc's ffast-math actually do? and related to the SO question of Clang optimization levels, I'm wondering what clang's -Ofast optimization does in practical terms and whether these differ at all from gcc or is this more hardware dependent than compiler dependent.

According to the accepted answer for clang's optimization levels, -Ofast adds to the -O3 optimizations: -fno-signed-zeros -freciprocal-math -ffp-contract=fast -menable-unsafe-fp-math -menable-no-nans -menable-no-infs. Which seems to be entirely floating point math related. But what will these optimizations mean in practical terms for things like C++ Common mathematical functions for floating point numbers on a CPU like an Intel Core i7 and how reliable are these differences?

For example, in practical terms:

The code std::isnan(std::numeric_limits<float>::infinity() * 0) returns true for me with -O3. I believe that this is what's expected of IEEE math compliant results.

With -Ofast however, I get a false return value. Additionally, the operation (std::numeric_limits<float>::infinity() * 0) == 0.0f returns true.

I don't know whether this is the same as what's seen with gcc. It's not clear to me how architecture dependent the results are, nor how compiler dependent they are, nor whether there's any applicable standard to -Ofast.

If anyone has perhaps produced something like a set of unit tests or code koans that answers this, that may be ideal. I've started to do something like this but would rather not reinvent the wheel.

Louis Langholtz
  • 2,913
  • 3
  • 17
  • 40
  • It's basically `-O3 -ffast-math`. Also, compile-time-constants work differently from the behaviour for runtime-variable values with `-ffast-math`. An FP multiply will still compile to something like `mulss` with `-ffast-math`, and the hardware will still produce NaN in the `inf * 0.0` case. What you're seeing is a compile-time optimization of `anything * 0.0 => 0.0` – Peter Cordes Aug 18 '17 at 23:43

1 Answers1

34

Describing how each of these flags affect each of the math function would require too much work, I'll try to give an example for each instead.
Leaving to you the burden to see how each could affect a given function.


-fno-signed-zeros

Assumes that your code doesn't depend on the sign of zero.
In FP arithmetic zero is not an absorbing element w.r.t. the multiplication: 0 · x = x · 0 ≠ 0 because zero has a sign and thus, for example -3 · 0 = -0 ≠ 0 (Where 0 usually denotes +0).

You can see this live on Godbolt where a multiplication by zero is unfolded to a constant zero only with -Ofast

float f(float a)
{
    return a*0;
}

;With -Ofast
f(float):                                  # @f(float)
        xorps   xmm0, xmm0
        ret

;With -O3
f(float): # @f(float)
  xorps xmm1, xmm1
  mulss xmm0, xmm1
  ret

A EOF noted in the comments this also depends on finite arithmetic.

-freciprocal-math

Use reciprocals instead of divisors: a/b = a · (1/b).
Due to the limitedness of FP precision, the equal sign is really not there.
Multiplication is faster than division, see Fog's tables.
See also why-is-freciprocal-math-unsafe-in-gcc?.

Live example on Godbolt:

float f(float a){
    return a/3;
}

;With -Ofast
.LCPI0_0:
        .long   1051372203              # float 0.333333343
f(float):                                  # @f(float)
        mulss   xmm0, dword ptr [rip + .LCPI0_0]
        ret

;With -O3
.LCPI0_0:
  .long 1077936128 # float 3
f(float): # @f(float)
  divss xmm0, dword ptr [rip + .LCPI0_0]
  ret

-ffp-contract=fast

Enable contraction of FP expression.
Contraction is an umbrella term for any law you can apply in the field ℝ that results in a simplified expression.
For example, a * k / k = a.

However, the FP numbers set equipped with + and · is not a field in general due to finite precision.
This flag allows the compiler to contract FP expression at the cost of correctness.

Live example on Godbolt:

float f(float a){
    return a/3*3;
}

;With -Ofast 
f(float):                                  # @f(float)
        ret

;With -O3
.LCPI0_0:
  .long 1077936128 # float 3
f(float): # @f(float)
  movss xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
  divss xmm0, xmm1
  mulss xmm0, xmm1
  ret

-menable-unsafe-fp-math

Kind of the above but in a broader sense.

Enable optimizations that make unsafe assumptions about IEEE math (e.g. that addition is associative) or may not work for all input ranges. These optimizations allow the code generator to make use of some instructions which would otherwise not be usable (such as fsin on X86).

See this about the error precision of the fsin instruction.

Live example at Godbolt where a4 is exanded into (a2/sup>)2:

float f(float a){
    return a*a*a*a;
}

f(float):                                  # @f(float)
        mulss   xmm0, xmm0
        mulss   xmm0, xmm0
        ret

f(float): # @f(float)
  movaps xmm1, xmm0
  mulss xmm1, xmm1
  mulss xmm1, xmm0
  mulss xmm1, xmm0
  movaps xmm0, xmm1
  ret

-menable-no-nans

Assumes the code generates no NaN values.
In a previous answer of mine I analysed how ICC dealt with complex number multiplication by assuming no NaNs.

Most of the FP instruction deals with NaNs automatically.
There are exceptions though, such as comparisons, this can be seen in this live at Godbolt

bool f(float a, float b){
    return a<b;
}

;With -Ofast
f(float, float):                                 # @f(float, float)
        ucomiss xmm0, xmm1
        setb    al
        ret

;With -O3
f(float, float): # @f(float, float)
  ucomiss xmm1, xmm0
  seta al
  ret

Note that the two versions are not equivalent as the -O3 one exluded the case where a and b are unordered while the other one include it in the true result.
While the performance is the same in this case, in complex expression this asymmetry can lead to different unfolding/optimisations.

-menable-no-infs

Just like the above but for infinities.

I was unable to reproduce a simple example in Godbolt but the trigonometric functions need to deal with infinities carefully, especially for complex numbers.

If you browse the a glibc implementation's math dir (e.g. sinc) you'll see a lot of checks that should be omitted on compilation with -Ofast.

Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124
  • 2
    related to `-menable-unsafe-fp-math` [Why doesn't GCC optimize `a*a*a*a*a*a` to `(a*a*a)*(a*a*a)`?](https://stackoverflow.com/q/6430448/995714) – phuclv Aug 15 '17 at 08:47
  • This answer helps. Thank you! Are there any differences in the behaviors between what clang does for these flags compared with gcc? Seems `-Ofast` says it's okay to break strict IEEE compliance but it's not clear whether that means there's any standard left for floating point math to still adhere to and whether what's done is compiler dependent or purely hardware dependent. – Louis Langholtz Aug 15 '17 at 15:33
  • @LưuVĩnhPhúc Thank you for identifying the question that you cited. I don't recall having looked at that one before and the answers there are also insightful for me. Are there differences though between the results from gcc and the results from clang? I haven't finished reading all the answers and comments yet so maybe someone already answers that. Also, I just found [IEEE 754 floating-point test software](http://www.math.utah.edu/~beebe/software/ieee/) which seems to have relevant information. – Louis Langholtz Aug 15 '17 at 15:53
  • 2
    In first point, `-fno-signed-zeros`, your example also depends on `-ffinite-math-only`, I believe. Correctly signed zero for `a * 0` could be produced by `copysign(0, a)` which could be implemented with an `AND` instruction for finite, non-NAN `a`. On the other side, producing constant `0` is incorrect for `0 * INF` and `0 * NAN` regardless of whether zeroes are signed. – EOF Aug 15 '17 at 16:02