If this is how you are doing optimizations, you are doing it wrong. I am sorry, I don't usually use negative terms when explaining something, but you are on a wrong path.
What you are doing is called premature optimization and micro-optimization. You are trying to optimize something you don't know it needs optimizing. First, and this is a deal breaker: use compiler optimizations. And then you would normally profile and try to optimize the hot spots.
Lets see how relevant the optimization you are trying to do is:
I firstly compile with -O3
(gcc 6.1): Let's see the output (also on the Godbolt compiler explorer):
mul_inverse3(double):
mulsd .LC0(%rip), %xmm0
cvttsd2si %xmm0, %eax
ret
div3(double):
divsd .LC1(%rip), %xmm0
cvttsd2si %xmm0, %eax
ret
where .LCO
and .LC1
are the constants: the nearest double
to 1/3.0, and 3.0
:
.LC0:
.long 1431655765
.long 1070945621
.LC1:
.long 0
.long 1074266112
Ok so you already see how much the compiler can do on it's own. This is what you should have been looking at, not the noisy -O0
output.
Which one is faster? As a rule of thumb multiplication is faster than division on all CPUs. The ratio depends on the specific microarchitecture. For x86, see Agner Fog's instruction tables and microarch pdf, and other perf links in the x86 tag wiki. For example, ~3x the latency and ~1/12th the throughput on Intel Sandybridge. And still, div might not even be the bottleneck, so div vs. mul might not affect the total performance. Cache misses, other pipelines stalls, or even other things like IO could hide the difference entirely.
But they are still different. Can we obtain the same code from the compiler? Yes! Add -ffast-math
. With this option, the compiler can rearrange/change floating operations even if it changes slightly the result (which is what you are trying to do by hand). Be careful though as this applies to the whole program.
mul_inverse3(double):
mulsd .LC0(%rip), %xmm0
cvttsd2si %xmm0, %eax
ret
div3(double):
mulsd .LC0(%rip), %xmm0
cvttsd2si %xmm0, %eax
ret
Can we do more? Yes: add -march=native
and search the compiler documentation for more switches.
So that was the first lesson: Let the compiler do it's optimizations first!
Here comes the second:
You spend 1 week trying to optimize a random operations. You finally do it! After a week of hard work and sleepless nights you make that operation 10 times faster (wow! congratulations). Then you run your program and see that the program is only 1% faster. The horror! How can this be possible? Well if your program spends only 1% of it's time doing that operation, then... you get the point. Or there can be other mechanisms underneath like OS optimizations, CPU pipeline etc. that make an operation repeated 100 times to consume much much less than 100 times more. What you need to do is profile first! Find the hot loops and optimize those! Optimize the the loop/function the program spends 60% of it's time.
More importantly, look for high level optimizations so your code can do less work, instead of just doing the same work faster.