Clang reciprocal to 1 optimisations

Question

After a discussion with colleagues, I ended up testing wether if clang would optimize two divisions, with a reciprocal to 1, to a single division.

const float x = a / b; //x not used elsewhere const float y = 1 / x;

Theoretically clang could optimize to const float y = b / a if x is used only as a temporary step value, no?

Here's the input&output of a simple test case: https://gist.github.com/Jiboo/d6e839084841d39e5ab6 (in both ouputs you can see that it's performing the two divisions, instead of optimizing)

This related question, is behind my comprehension and seem to focus only on why a specific instruction isn't used, whereas in my case it's the optimisation that isn't done: Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math

Thanks, JB.

There are lots of "why doesn't compiler X perform optimization Y" questions on SO, and the answer is often that it comes down to a cost/benefit analysis. Each optimization takes time to implement and adds more code that has to be maintained, and the performance increase may not be worthwhile. Do you really want to make your compiler more complicated to save a single division instruction? There will probably always be optimizations that the compiler doesn't spot -- is this one high on the list of cases that would help a lot? — Caleb, Sep 21 '15 at 14:13
In this case it's not cost-benefit analysis. The compiler can and probably does make the optimization with options (e.g. `-ffast-math`) which tell it that it's okay to make incorrect transformations that are "approximately correct" if they improve performance. It can't make this transformation by default however because it changes the observable behavior of the program. — R.. GitHub STOP HELPING ICE, Sep 21 '15 at 15:01
@R.. I searched for that kind of flag after I saw it weren't optimized, -ffast-math keeps the 1/x, didn't find one that would do the trick. Added the output, to the gist. (Don't look for it, I just thought -O3 already triggered such behaviors). — Jiboo, Sep 21 '15 at 15:20
With gcc, `-freciprocal-math` does it. Oddly `-ffast-math` does not enable the former. (See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) Maybe the same is true for clang? — R.. GitHub STOP HELPING ICE, Sep 21 '15 at 15:30
@R..: actually, `-freciprocal-math` is part of `-funsafe-math-optimizations`, which **is** part of `-ffast-math`. The option is for doing `1/y` once, then multiplying by it inside a loop. (Or otherwise pulling `1/y` out with common-subexpression-elimination (CSE).) I can't get gcc or clang to turn the OP's code into a single divss, with any combination of options I've tried. https://goo.gl/yaLuYh (godbolt, with the code in a function `float recip_test(float, float)` for easy testing) — Peter Cordes, Sep 21 '15 at 21:29
Gcc has a `-mrecip` option to use the hardware fast-approximate-reciprocal instruction, with one newton-raphson iteration. Even using that, it does the whole process twice. Once to get 1/b and multiply that with a, then again to get 1/(a/b). So @Caleb: Yes, people absolutely do want to make compilers more complicated to save a single division instruction (where it's possible). Division is one of the slowest things, and the division unit usually isn't even fully pipelined. Even Skylake's can't to one FP div per clock. I'm surprised gcc and clang don't do this already with `-ffast-math`. — Peter Cordes, Sep 21 '15 at 21:33

score 4 · Accepted Answer · answered Sep 21 '15 at 14:15

No, clang can not do that.

But first, why are you using float? float has six digits precision, double has 15. Unless you have a good reason, that you can explain, use double.

1 / (a / b) in floating-point arithmetic is not the same as b / a. What the compiler has to do, is in the first case:

Divide a by b
Round the result to the nearest floating-point number
Divide 1 by the result
Round the result to the nearest floating-point number.

In the second case:

Divide b by a.
Round the result to the nearest floating-point number.

The compiler can only change the code if the result is guaranteed to be the same, and if the compiler writer cannot produce a mathematical proof that the result is the same, the compiler cannot change the code. There are two rounding operations in the first case, rounding different numbers, so it is unlikely that the result can be guaranteed to be the same.

A trivial example where the transformation is wrong: `a=1.0` and `b` denormal. In this case `a/b` is infinity, and `1/(a/b)` is 0, but `b/a` is simply `b` and is nonzero. — R.. GitHub STOP HELPING ICE, Sep 21 '15 at 15:03

score 1 · Answer 2 · answered Sep 21 '15 at 14:47

The compiler doesn't think like a mathematician. Where you think simplifying the expression is trivial mathematically, the compiler has a lot of other things to consider. It is actually quite likely that the compiler is much smarter than the programmer and also knows far more about the C standard.

Something like this is probably what goes through the optimizing compiler's "mind":

Ah they wrote a / b but only use x at one place, so we don't have to allocate that variable on the stack. I'll remove it and use a CPU register.
Hmm, integer literal 1 divided with a float variable. Okay, we have to invoke balancing here before anything else and turn that literal into a float 1.0f.
The programmer is counting on me to generate code that contains the potential floating point inaccuracy involved in dividing 1.0f with another float variable! So I can't just swap this expression with b / a because then that floating point inaccuracy that the programmer seems to want here would be lost.

And so on. There's a lot of considerations. What machine code you end up with is hard to predict in advance. Just know that the compiler follows your instructions to the letter.

It does think like a mathematician, and an actual mathematician knows the transformation is valid for real numbers but not for floating point numbers. — R.. GitHub STOP HELPING ICE, Sep 21 '15 at 14:59
Thank you, I though -O3 would authorize such inaccuracies, for the sake of speed. I stand corrected. — Jiboo, Sep 21 '15 at 15:12

Clang reciprocal to 1 optimisations

2 Answers2

Linked