0

Let's consider the following function:

#include <stdint.h>
uint64_t foo(uint64_t x) { return x * 3; }

If I were to write it, I'd do

.global foo
.text
foo:
    imul %rax, %rdi, $0x3
    ret

On the other hand, the compiler generates two additions, with -O0:

   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 89 7d f8             mov    %rdi,-0x8(%rbp)
   8:   48 8b 55 f8             mov    -0x8(%rbp),%rdx
   c:   48 89 d0                mov    %rdx,%rax
   f:   48 01 c0                add    %rax,%rax
  12:   48 01 d0                add    %rdx,%rax
  15:   5d                      pop    %rbp
  16:   c3                      retq   

or lea with -O2:

0000000000000000 <foo>:
   0:   48 8d 04 7f             lea    (%rdi,%rdi,2),%rax
   4:   c3                      retq   

Why? Since every assembly instruction equals one processor clock tick, my version should run within 2 CPU clock cycles (since it has two instructions), in the -O0 we need 4 cycles for performing addition, because it could be rewritten to

  mov    %rdi,%rax
  add    %rax,%rax
  add    %rdi,%rax
  retq

and the lea should take two cycles either.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
marmistrz
  • 5,974
  • 10
  • 42
  • 94
  • 11
    "Since every assembly instruction equals one processor clock tick" this is not true. – Margaret Bloom Mar 07 '17 at 16:08
  • 1
    Your title does not reflect the body of your question. – Jester Mar 07 '17 at 16:09
  • The title being essentially "*What's the `q` for?*" – David C. Rankin Mar 07 '17 at 16:11
  • Yes, I gave it the wrong title. Fixed that. – marmistrz Mar 07 '17 at 16:14
  • Often compilers know better than a human what is the fastest way. Sometimes not, but often yes. I remember that on X86, the compiler always produced the sequence "push bp; mov bp, sp; sub sp,xxx" instead of the single one "enter xxx". It turned out that the compiler was right, those three instructions were faster than a single "enter xxx". – linuxfan says Reinstate Monica Mar 07 '17 at 16:36
  • Relevant: [Why does my assembly use “+” and “*” signs instead of instructions for calculating addresses?](/q/23659656) contains some of the answer. The mention that the address-arithmetic unit is faster than ALU addition misses another point, namely that arithmetic resources may be used by other threads or by speculative/out-of-order execution if simple code like this leaves them available. That's really hard to get right without the compiler thinking for you. – Toby Speight Mar 07 '17 at 17:35
  • Other somewhat-duplicates: [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) / [Assembly why lea is fast?](https://stackoverflow.com/q/67034266) – Peter Cordes Jun 25 '21 at 04:58

1 Answers1

1

You're targeting a processor with dedicated address-calculation units. It's likely to be faster to compute small multiplications in the address calculator than in a general-purpose arithmetic/logic unit (ALU).

Also, depending on your processor model, the ALU may be shared with other code, either due to hyperthreading or by speculative or out-of-order execution within the same thread. Your compiler is making a good estimate of how best to utilise the available resources to give a good throughput of execution without stalling.

The idea that "every assembly instruction equals one processor clock tick" (or even a fixed number of cycles) has only ever been true on the very simplest of processors.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
  • Is there any resource about how many instruction cycles each instruction takes? – marmistrz Mar 07 '17 at 19:03
  • Where does the multiplication happen then? – marmistrz Mar 07 '17 at 19:36
  • With old-school microprocessors such as 8080, it's possible to count cycles, but with superscalar out-of-order processors such as x86-64, it's a lot more difficult, because instructions overlap so much. Not only that, but instruction and data access times can vary greatly depending on whether the accessed memory is in cache (and which level of cache). – Toby Speight Mar 08 '17 at 09:43
  • I don't know for sure, but I expect there's an address calculator unit that's much smaller than the ALU, and can do addition and some shifts (i.e. multiply by 1,2,4,8,16), probably with a restricted set of source operands (perhaps a subset of registers, some immediate or fixed values and not much else). It's possible for some implementations of the instruction set to turn to main ALU for these operations while more expensive chips have dedicated addressing unit, so you really need to consult your documentation. – Toby Speight Mar 08 '17 at 09:47
  • 2
    @marmistrz: [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) and [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) have some good info. You can't simply add up a "cycle cost" for each instruction, the whole point of superscalar out-of-order exec is to do better than that. – Peter Cordes Jun 25 '21 at 04:52
  • 1
    There isn't a single "the ALU", for example Haswell has integer ALUs on four separate execution ports, two of which can run LEA (at 1 cycle latency), 1 of which can run imul (at 3 cycle latency). https://www.realworldtech.com/haswell-cpu/4/ / https://uops.info/, especially https://agner.org/optimize/. Out-of-order exec CPUs generally don't run LEA on an AGU, instead just replicating the shift-and-add logic in an ALU instead of having scheduling competition between loads/stores and instructions that don't access memory. – Peter Cordes Jun 25 '21 at 04:55