Reading assembly comparing multiplication vs division

Question

I am trying to see if in our compilation there is a gain between writing

x/3.0

versus const double one_over_three = 1.0/3.0; x * one_over_three;

But I need a bit of help reading assembly, which I haven't learned.

I compiled

int div3(double x) {
  const double three = 1/3.0;
  return x*three;
}

and obtained

    div3(double):
    pushq   %rbp
    movq    %rsp, %rbp
    movsd   %xmm0, -24(%rbp)
    movabsq $4599676419421066581, %rax
    movq    %rax, -8(%rbp)
    movsd   -24(%rbp), %xmm1
    movsd   .LC0(%rip), %xmm0
    mulsd   %xmm1, %xmm0
    cvttsd2si       %xmm0, %eax
    popq    %rbp
    ret
    .LC0:
    .long   1431655765
    .long   1070945621

Then compiled

int div3(double x) {
  return x/3.0;
}

and obtained

    div3(double):
    pushq   %rbp
    movq    %rsp, %rbp
    movsd   %xmm0, -8(%rbp)
    movsd   -8(%rbp), %xmm0
    movsd   .LC0(%rip), %xmm1
    divsd   %xmm1, %xmm0
    cvttsd2si       %xmm0, %eax
    popq    %rbp
    ret
    .LC0:
    .long   0
    .long   1074266112

Q1: Could someone kindly help me read the assembly?

Q2: What is the meaning of those two last lines in each of the codes?

I'd say that the first asm code is longer than the second one, so it may take slightly more time to run. — ForceBru, Aug 16 '16 at 14:14
Try this: https://en.wikibooks.org/wiki/X86_Assembly/GAS_Syntax — Michael Kruglos, Aug 16 '16 at 14:14
you really have to time them to find out, and of course that result only applies to that machine at that address where you located the tests. Also you should also try optimizing when you compile. It may reduce some of that code and some of the accesses on both tests. — old_timer, Aug 16 '16 at 14:24

score 6 · Answer 1 · edited Aug 16 '16 at 20:14

If this is how you are doing optimizations, you are doing it wrong. I am sorry, I don't usually use negative terms when explaining something, but you are on a wrong path.

What you are doing is called premature optimization and micro-optimization. You are trying to optimize something you don't know it needs optimizing. First, and this is a deal breaker: use compiler optimizations. And then you would normally profile and try to optimize the hot spots.

Lets see how relevant the optimization you are trying to do is:

I firstly compile with -O3 (gcc 6.1): Let's see the output (also on the Godbolt compiler explorer):

mul_inverse3(double):
        mulsd   .LC0(%rip), %xmm0
        cvttsd2si       %xmm0, %eax
        ret

div3(double):
        divsd   .LC1(%rip), %xmm0
        cvttsd2si       %xmm0, %eax
        ret

where .LCO and .LC1 are the constants: the nearest double to 1/3.0, and 3.0:

.LC0:
        .long   1431655765
        .long   1070945621
.LC1:
        .long   0
        .long   1074266112

Ok so you already see how much the compiler can do on it's own. This is what you should have been looking at, not the noisy -O0 output.

Which one is faster? As a rule of thumb multiplication is faster than division on all CPUs. The ratio depends on the specific microarchitecture. For x86, see Agner Fog's instruction tables and microarch pdf, and other perf links in the x86 tag wiki. For example, ~3x the latency and ~1/12th the throughput on Intel Sandybridge. And still, div might not even be the bottleneck, so div vs. mul might not affect the total performance. Cache misses, other pipelines stalls, or even other things like IO could hide the difference entirely.

But they are still different. Can we obtain the same code from the compiler? Yes! Add -ffast-math. With this option, the compiler can rearrange/change floating operations even if it changes slightly the result (which is what you are trying to do by hand). Be careful though as this applies to the whole program.

mul_inverse3(double):
        mulsd   .LC0(%rip), %xmm0
        cvttsd2si       %xmm0, %eax
        ret

div3(double):
        mulsd   .LC0(%rip), %xmm0
        cvttsd2si       %xmm0, %eax
        ret

Can we do more? Yes: add -march=native and search the compiler documentation for more switches.

So that was the first lesson: Let the compiler do it's optimizations first!

Here comes the second:

You spend 1 week trying to optimize a random operations. You finally do it! After a week of hard work and sleepless nights you make that operation 10 times faster (wow! congratulations). Then you run your program and see that the program is only 1% faster. The horror! How can this be possible? Well if your program spends only 1% of it's time doing that operation, then... you get the point. Or there can be other mechanisms underneath like OS optimizations, CPU pipeline etc. that make an operation repeated 100 times to consume much much less than 100 times more. What you need to do is profile first! Find the hot loops and optimize those! Optimize the the loop/function the program spends 60% of it's time.

More importantly, look for high level optimizations so your code can do less work, instead of just doing the same work faster.

-ffast-math is dangerous and applies to everything at once. Replacing a specific division (where you know it's OK to do) by multiplication with reciprocal is completely valid and not micro. — harold, Aug 16 '16 at 14:47
@harold agree. But unless there are scientific calculations involved, I suspects (just a personal hunch) it's mostly ok. — bolov, Aug 16 '16 at 14:56
Thank you. Actually, I was doing this little comparison because I was arguing that replacing division by 3.0 by multiplication by `double 1.0/3.0` might not lead to any optimization. I was even concerned that maybe it could even be worst: 1) Due to double 1..0/3.0 not being the same as the multiplicative inverse of 3. 2) Perhaps the particular division by 3 happened to be faster than multiplication by 1/3 since 3 = 2+1 is rather simpler in base 2 compared to `double 1.0/3.0`. — myfirsttime1, Aug 16 '16 at 15:17
Yes, FP multiply is always faster than FP divide on all known x86 implementations. By a big factor. See [Agner Fog's instruction tables and microarch pdf](http://agner.org/optimize) and other perf links in the x86 [tag wiki](http://stackoverflow.com/tags/x86/info) — Peter Cordes, Aug 16 '16 at 19:46
@myfirsttime1: multiplication is constant time, not data-dependent, on modern hardware, except for denormals. Division is one of very few variable-latency ALU operations. Even the best case for division is slower than multiplication, but the best case is division by a power of 2. See [this answer with some experimental throughput numbers for Nehalem](http://stackoverflow.com/a/38857588/224132) — Peter Cordes, Aug 16 '16 at 20:18
@PeterCordes I want to thank you for this excellent edit. You improved the answer substantially. — bolov, Aug 17 '16 at 05:09

score 3 · Answer 2 · answered Aug 16 '16 at 14:42

The .longs are simply the constants you supplied in the functions 1/3rd and 3. 3.0 in double is 0x4008000000000000 when broken into 32 bit chunks you get one of them zero the other 0x40080000 which in decimal is 1074266112. All they are doing there is pre-computing the constants and loading them into registers to perform the math requested by the code. Why use .longs and decimal? Well they have to choose something in order to generate the bit patterns they wanted, it could have been an array of bytes or whatever. And decimal vs hex, most folks even programmers think in decimal so they pick one.

Same goes for the other constant 1/3rd which is roughly 0x3FD5555555555555 which becomes 1431655765 1070945621. Granted being float you might see some rounding differences if you compute this yourself

If you optimize you will reduce the amount of asm you have to figure out

int div3(double x) {
  return x/3.0;
}

and of course dramatically changes your performance

0000000000000000 <div3>:
   0:   f2 0f 5e 05 00 00 00    divsd  0x0(%rip),%xmm0        # 8 <div3+0x8>
   7:   00 
   8:   f2 0f 2c c0             cvttsd2si %xmm0,%eax
   c:   c3                      retq   

Disassembly of section .rodata.cst8:

0000000000000000 <.LC0>:
   0:   00 00                   add    %al,(%rax)
   2:   00 00                   add    %al,(%rax)
   4:   00 00                   add    %al,(%rax)
   6:   08                      .byte 0x8
   7:   40                      rex

this is not linked. note there is our constant in a different form.

int div3(double x) {
  const double three = 1/3.0;
  return x*three;
}

0000000000000000 <div3>:
   0:   f2 0f 59 05 00 00 00    mulsd  0x0(%rip),%xmm0        # 8 <div3+0x8>
   7:   00 
   8:   f2 0f 2c c0             cvttsd2si %xmm0,%eax
   c:   c3                      retq   

Disassembly of section .rodata.cst8:

0000000000000000 <.LC0>:
   0:   55                      push   %rbp
   1:   55                      push   %rbp
   2:   55                      push   %rbp
   3:   55                      push   %rbp
   4:   55                      push   %rbp
   5:   55                      push   %rbp
   6:   d5                      (bad)  
   7:   3f                      (bad)

And I assume you understand being floating point, even with double, there is no reason to expect the same result from these two functions.

all the code is doing is performing the multiply or divide against a pre-computed constant and the converting to int. — old_timer, Aug 16 '16 at 14:44

Reading assembly comparing multiplication vs division

2 Answers2

Linked