2

This question is about how we multiply an integer with a constant. So let's look at a simple function:

int f(int x) {
    return 10*x;
}

How can that function be optimized best, especially when inlined into a caller?

Approach 1 (produced by most optimizing compilers (e.g. on Godbolt))

    lea    (%rdi,%rdi,4), %eax
    add    %eax, %eax

Approach 2 (produced with clang3.6 and earlier, with -O3)

    imul   $10, %edi, %eax

Approach 3 (produced with g++6.2 without optimization, removing stores/reloads)

    mov    %edi, %eax
    sal    $2, %eax
    add    %edi, %eax
    add    %eax, %eax

Which version is fastest, and why? Primarily interested in Intel Haswell.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
overseas
  • 1,711
  • 1
  • 18
  • 30
  • 2
    Try to build with optimization enabled. Add the `-O2` flag and check the assembly then. – Some programmer dude Nov 12 '16 at 14:48
  • 3
    The performance of unoptimized C++ builds (and thus the performance of the produced assembly) is completely meaningless. Compile with -O2 or -O3 to get meaningful output. – Baum mit Augen Nov 12 '16 at 14:50
  • For downvoters I added a comment. – overseas Nov 12 '16 at 14:55
  • I know math and algorithms, not chips. It makes sense that shift and add would be faster than any sort of multiplication, but have you tried timing them? – Kenny Ostrom Nov 12 '16 at 15:04
  • @KennyOstrom See my answer below – overseas Nov 12 '16 at 16:41
  • 1
    clang output is still from `-O0`. And instead of `volatile`, just write a function that takes an arg and returns a value, so there's nothing to optimize away. `int foo(int a){return a*10;}`: https://godbolt.org/g/uThcu7. Then gcc, clang, and icc (with `-O3 -march=haswell`) all pick the 2 uops, 2c latency ADD+LEA version instead of the 1 uop 3c latency `imul r,r,imm` version. (clang3.6 and earlier does use just a single IMUL, but later clang changed to the lower-latency version that takes more uops, at least in this case. I doubt clang models the CPU frontend to decide on lat vs. tput). – Peter Cordes Nov 12 '16 at 23:50
  • 2
    *As it turns out, this code is even much less efficient than using imul,* You mean when compiled with `-O0` so variables are stored / reloaded to memory after every statement? Yeah go figure. There is a real question here, but it's buried under a ton of bogus nonsense from using un-optimized compiler output. – Peter Cordes Nov 12 '16 at 23:52
  • Yes, true, but you misunderstand my second question. It is not about performance. We have three versions: imul, the bad performing complicated shift, and the fast and not that easy-to-read lea version. I understand when a compiler uses the first version, and I understand when a compiler would use the third version. But why should they choose the version which is slow **and** hard to read? – overseas Nov 13 '16 at 09:05
  • And yes clang output is with O0, because it is in the section where I talk about O0. Whether I use volatile or another function should not matter, because in my actual test I calculate thousands of time, and finally (and only then) I assign it to a volatile variable. So it doesn't matter at all, as long as I get the effect I want (and I always looked at the assembly). Will look at the other comments below later. – overseas Nov 13 '16 at 09:17
  • Your update was an improvement, but still had some wasted MOV instructions, most importantly in the LEA version, but gcc6.2 doesn't have any more MOV instructions than needed in the -O0 version. If we're talking about a candidate for fastest, obviously we don't want extra wasted MOVs. Fixed that for you, and linked to Godbolt where the asm came from. These are all now valid functions with the arg in EDI and the result in EAX, just leaving out the RET instruction because we're talking about inlining without function-call overhead. – Peter Cordes Nov 13 '16 at 18:43
  • Thanks. I added the versions below with new data. Will add a link to godbolt in one minute. – overseas Nov 13 '16 at 20:13

3 Answers3

6

According to Agner Fog's testing (and other stuff like AIDA64) Intel CPUs since Core2 have had imul r32,r32, imm latency of 3c, throughput one per 1c. Since Nehalem, 64-bit multiplies are also that fast. (Agner says Nehalem's imul r64,r64,imm slower (2c throughput) than imul r64,r64, but that doesn't match other results. Instlatx64 says 1c.)

AMD CPUs before Ryzen are slower, e.g. Steamroller has lat=4c tput=one per 2c for 32-bit multiply. For 64-bit multiply, lat=6c tput=one per 4c. AMD Ryzen has the same excellent multiply performance as Intel.


LEA with 2 components in the addressing mode (base + index, but no constant displacement) runs in 1c latency on all Intel CPUs1, except maybe for Atom where LEA runs in a different stage of the pipeline (in the actual AGU, not the ALU) and needs its input ready 4c earlier than a "normal" ALU instruction. Conversely, its input is ready sooner so the ADD can use the result the same cycle, I think. (I haven't tested this, and don't have any Atom HW.)

On Intel SnB-family, simple-LEA can run on ports 1 or 5, so it has twice the throughput of IMUL.

ADD can run on any ALU port on any CPU. HSW introduced a 4th ALU port (vs. IvyBridge), so it can sustain 4 ALU uops per clock (in theory).

So the LEA+ADD version has 2c latency on most x86 CPUs, and on Haswell can run two multiplies per clock.

Footnote 1: On AMD (including Zen / Zen2), a scaled-index makes an LEA "slow" (2 cycle latency and runs on fewer ports). e.g. lea r32, [r64+r64*2] measured at 2 cycle latency on Zen2 vs. 1 cycle on Skylake. (Agner Fog also mentions that lea r32, [r64...] is slower on AMD, but that might only have been a Bulldozer effect; it's not apparent in https://uops.info/'s results for Zen / Zen2.)


But if the multiply is only one small part of a bigger surrounding loop that bottlenecks on total uop throughput, not multiply latency or throughput, the IMUL version is better.


If your multiply constant is too big for two LEAs, or a SHL+LEA, then you're probably better off with IMUL, especially when tuning primarily for Intel CPUs with their extremely high performance integer multipliers.

SHL+LEA or SHL+SUB might be useful e.g. to multiply by 63. (from Godbolt: gcc6.2 -O3 -march=haswell)

    movl    %edi, %eax
    sall    $6, %eax
    subl    %edi, %eax

On Haswell, where MOV is zero-latency, this has only 2c latency. But it's 3 fused-domain uops vs. 1 for imull $63, %edi, %eax. So it's more uops in the pipeline, reducing how far ahead the CPU can "see" to do out-of-order execution. It also increases pressure on the uop cache, and L1 I-cache, for a compiler to consistently pick this strategy, because it's more instruction bytes.

On CPUs before IvyBridge, this is strictly worse than IMUL unless something else is competing for port1, because it's 3c latency (the MOV is on the critical path dependency chain, and has 1c latency).

As usual, none of the asm fragments can be said to be optimal for all situations. It depends on what the bottleneck is in the surrounding code: latency, throughput, or uops.

The answer will be different for the same surrounding code on different microarchitectures, too.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
2

I'd guess that the shift-and-add sequence was faster than imul; this has been true for many versions of x86 chips. I don't know if it is true of Haswell; still, doing a imul in 2 clock cycles takes significant chip resources if it is doable at all.

I'm a bit surprised it didn't produce an even faster sequence:

 lea   y, [2*y]
 lea   y, [5*y]

[OP edits his answer, shows optimized code producing ADD then LEA. Yes, that's a better answer; the ADD r,r is smaller spacewise than lea ..[2*y] so the resulting code is smaller and the same speed]

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Thanks, is there any resource providing numbers foir this? – overseas Nov 12 '16 at 14:56
  • 1
    www.agnesfog.org is a widely respected resource for understanding how to optimize x86 code. Actual cycle counts for modern CPUs are hard to obtain in reference documents; fog seems to measure everything and is generally pretty up to date. – Ira Baxter Nov 12 '16 at 14:59
  • 1
    This website does not exist. Do you probably mean the documents on agner.org? – overseas Nov 12 '16 at 15:01
  • http://stackoverflow.com/questions/6120207/imul-or-shift-instruction mentions https://en.wikipedia.org/wiki/Strength_reduction – Kenny Ostrom Nov 12 '16 at 15:10
  • It seems to be faster generally, except you can make use of ILP (see measurements below). Would be glad if you check if this makes sense, or if my assemblies are not optimized well. – overseas Nov 12 '16 at 16:41
  • ILP should pick up the add/shift/lea sequences pretty nicely especially if they are on separate registers. – Ira Baxter Nov 12 '16 at 17:27
  • The registers are assigned automatically by g++. With ILP and without loop unrolling I get the following code for theimul version (V1) http://pastebin.com/r3edQPXD, and the following code for the add/shift/lea version (V3): http://pastebin.com/vVB7JeKq. The results for imul at least are expected as the intel optimization manual mentionnes 3-4 cycles latency, and 1 cycle throughput. – overseas Nov 12 '16 at 19:41
  • 1
    @Ira: Haswell imul is always 3c latency for the multi-operand version, even for 64-bit operand-size. (SnB-family has no 2c latency uops to make scheduling easier/better, even if they wanted to spend that many transistors on an extremely low latency multiplier. This is why 3-component `lea eax, [rdi + rdi*2 + 1]` has 3c latency, not 2, since they didn't want to implement both adds in 1c like core2 did). BTW, out-of-order execution means that separate registers is irrelevant: WAW and WAR hazards are irrelevant thanks to register renaming. Only true dependencies (Read-After-Write) matter. – Peter Cordes Nov 13 '16 at 00:47
1

I just did some measurements. We mimic the following code in assembly, using the instructions given in the question:

    for (unsigned i = 0; i < (1 << 30); ++i) {
        r1 = r2 * 10;
        r2 = r1 * 10;
    }

Possibly there are some additional registers needed for temporaries.

Note: All measurements are in cycles per one multiplication.

We use clang compiler with -O3, because there we exactly get the assembly we want (gcc sometimes adds very few MOVs inside the loop). We have two parameters: u = #unrolled loops, and i = #ilp. For example for u=4, i=2, we get the following:

  401d5b:   0f a2                   cpuid  
  401d5d:   0f 31                   rdtsc  
  401d5f:   89 d6                   mov    %edx,%esi
  401d61:   89 c7                   mov    %eax,%edi
  401d63:   41 89 f0                mov    %esi,%r8d
  401d66:   89 f8                   mov    %edi,%eax
  401d68:   b9 00 00 20 00          mov    $0x200000,%ecx
  401d6d:   0f 1f 00                nopl   (%rax)
  401d70:   6b d5 0a                imul   $0xa,%ebp,%edx
  401d73:   41 6b f7 0a             imul   $0xa,%r15d,%esi
  401d77:   6b fa 0a                imul   $0xa,%edx,%edi
  401d7a:   6b d6 0a                imul   $0xa,%esi,%edx
  401d7d:   6b f7 0a                imul   $0xa,%edi,%esi
  401d80:   6b fa 0a                imul   $0xa,%edx,%edi
  401d83:   6b d6 0a                imul   $0xa,%esi,%edx
  401d86:   6b f7 0a                imul   $0xa,%edi,%esi
  401d89:   6b fa 0a                imul   $0xa,%edx,%edi
  401d8c:   6b d6 0a                imul   $0xa,%esi,%edx
  401d8f:   6b f7 0a                imul   $0xa,%edi,%esi
  401d92:   6b fa 0a                imul   $0xa,%edx,%edi
  401d95:   44 6b e6 0a             imul   $0xa,%esi,%r12d
  401d99:   44 6b ef 0a             imul   $0xa,%edi,%r13d
  401d9d:   41 6b ec 0a             imul   $0xa,%r12d,%ebp
  401da1:   45 6b fd 0a             imul   $0xa,%r13d,%r15d
  401da5:   ff c9                   dec    %ecx
  401da7:   75 c7                   jne    401d70 <_Z7measureIN5timer5rtdscE2V1Li16777216ELi4ELi2EEvv+0x130>
  401da9:   49 c1 e0 20             shl    $0x20,%r8
  401dad:   49 09 c0                or     %rax,%r8
  401db0:   0f 01 f9                rdtscp 

Note that these are not 8, but 16 imul instructions, because this is r2 = r1 * 10; r1 = r2 * 10; I will only post the main loop, i.e.,

  401d70:   6b d5 0a                imul   $0xa,%ebp,%edx
  401d73:   41 6b f7 0a             imul   $0xa,%r15d,%esi
  401d77:   6b fa 0a                imul   $0xa,%edx,%edi
  401d7a:   6b d6 0a                imul   $0xa,%esi,%edx
  401d7d:   6b f7 0a                imul   $0xa,%edi,%esi
  401d80:   6b fa 0a                imul   $0xa,%edx,%edi
  401d83:   6b d6 0a                imul   $0xa,%esi,%edx
  401d86:   6b f7 0a                imul   $0xa,%edi,%esi
  401d89:   6b fa 0a                imul   $0xa,%edx,%edi
  401d8c:   6b d6 0a                imul   $0xa,%esi,%edx
  401d8f:   6b f7 0a                imul   $0xa,%edi,%esi
  401d92:   6b fa 0a                imul   $0xa,%edx,%edi
  401d95:   44 6b e6 0a             imul   $0xa,%esi,%r12d
  401d99:   44 6b ef 0a             imul   $0xa,%edi,%r13d
  401d9d:   41 6b ec 0a             imul   $0xa,%r12d,%ebp
  401da1:   45 6b fd 0a             imul   $0xa,%r13d,%r15d
  401da5:   ff c9                   dec    %ecx
  401da7:   75 c7                   jne    401d70 <_Z7measureIN5timer5rtdscE2V1Li16777216ELi4ELi2EEvv+0x130>

Instead of rtdsc we use perf (PERF_COUNT_HW_REF_CPU_CYCLES = "ref cycles" and PERF_COUNT_HW_CPU_CYCLES = "cpu cycles").

We fix u = 16, and vary i = {1, 2, 4, 8, 16, 32}. We get

name    uroll ilp   ref cyclescpu cyclesp0      p1      p2      p3      p4      p5      p6      p7      
V1      16    1        2.723   3.006   0.000   1.000   0.000   0.000   0.000   0.000   0.031   0.000
V1      16    2        1.349   1.502   0.000   1.000   0.000   0.000   0.000   0.000   0.016   0.000
V1      16    4        0.902   1.002   0.000   1.000   0.000   0.000   0.000   0.000   0.008   0.000
V1      16    8        0.899   1.001   0.000   1.000   0.004   0.006   0.008   0.002   0.006   0.002
V1      16    16       0.898   1.001   0.000   1.000   0.193   0.218   0.279   0.001   0.003   0.134
V1      16    32       0.926   1.008   0.000   1.004   0.453   0.490   0.642   0.001   0.002   0.322

This makes sense. The ref cycles can be ignored.

The columns on the right show the number of microops on the execution ports. We have one instruction on p1 (the imul, of course) and on p6 we have the decrement (1/16 in the first case). Later we can also see that we get other microops due to strong register pressure.

Ok, let's move to the second version, which is now:

  402270:   89 ea                   mov    %ebp,%edx
  402272:   c1 e2 02                shl    $0x2,%edx
  402275:   01 ea                   add    %ebp,%edx
  402277:   01 d2                   add    %edx,%edx
  402279:   44 89 fe                mov    %r15d,%esi
  40227c:   c1 e6 02                shl    $0x2,%esi
  40227f:   44 01 fe                add    %r15d,%esi
  402282:   01 f6                   add    %esi,%esi
  402284:   89 d7                   mov    %edx,%edi
  402286:   c1 e7 02                shl    $0x2,%edi
  402289:   01 d7                   add    %edx,%edi
  40228b:   01 ff                   add    %edi,%edi
  40228d:   89 f2                   mov    %esi,%edx
  40228f:   c1 e2 02                shl    $0x2,%edx
  402292:   01 f2                   add    %esi,%edx
  402294:   01 d2                   add    %edx,%edx
  402296:   89 fe                   mov    %edi,%esi
  402298:   c1 e6 02                shl    $0x2,%esi
  40229b:   01 fe                   add    %edi,%esi
  40229d:   01 f6                   add    %esi,%esi
  40229f:   89 d7                   mov    %edx,%edi
  4022a1:   c1 e7 02                shl    $0x2,%edi
  4022a4:   01 d7                   add    %edx,%edi
  4022a6:   01 ff                   add    %edi,%edi
  4022a8:   89 f2                   mov    %esi,%edx
  4022aa:   c1 e2 02                shl    $0x2,%edx
  4022ad:   01 f2                   add    %esi,%edx
  4022af:   01 d2                   add    %edx,%edx
  4022b1:   89 fe                   mov    %edi,%esi
  4022b3:   c1 e6 02                shl    $0x2,%esi
  4022b6:   01 fe                   add    %edi,%esi
  4022b8:   01 f6                   add    %esi,%esi
  4022ba:   89 d7                   mov    %edx,%edi
  4022bc:   c1 e7 02                shl    $0x2,%edi
  4022bf:   01 d7                   add    %edx,%edi
  4022c1:   01 ff                   add    %edi,%edi
  4022c3:   89 f2                   mov    %esi,%edx
  4022c5:   c1 e2 02                shl    $0x2,%edx
  4022c8:   01 f2                   add    %esi,%edx
  4022ca:   01 d2                   add    %edx,%edx
  4022cc:   89 fe                   mov    %edi,%esi
  4022ce:   c1 e6 02                shl    $0x2,%esi
  4022d1:   01 fe                   add    %edi,%esi
  4022d3:   01 f6                   add    %esi,%esi
  4022d5:   89 d7                   mov    %edx,%edi
  4022d7:   c1 e7 02                shl    $0x2,%edi
  4022da:   01 d7                   add    %edx,%edi
  4022dc:   01 ff                   add    %edi,%edi
  4022de:   89 f2                   mov    %esi,%edx
  4022e0:   c1 e2 02                shl    $0x2,%edx
  4022e3:   01 f2                   add    %esi,%edx
  4022e5:   01 d2                   add    %edx,%edx
  4022e7:   89 fe                   mov    %edi,%esi
  4022e9:   c1 e6 02                shl    $0x2,%esi
  4022ec:   01 fe                   add    %edi,%esi
  4022ee:   01 f6                   add    %esi,%esi
  4022f0:   89 d5                   mov    %edx,%ebp
  4022f2:   c1 e5 02                shl    $0x2,%ebp
  4022f5:   01 d5                   add    %edx,%ebp
  4022f7:   01 ed                   add    %ebp,%ebp
  4022f9:   41 89 f7                mov    %esi,%r15d
  4022fc:   41 c1 e7 02             shl    $0x2,%r15d
  402300:   41 01 f7                add    %esi,%r15d
  402303:   45 01 ff                add    %r15d,%r15d
  402306:   ff c8                   dec    %eax
  402308:   0f 85 62 ff ff ff       jne    402270 <_Z7measureIN5timer5rtdscE2V2Li16777216ELi4ELi2EEvv+0xe0>

Measurements for V2

name    uroll ilp   ref cyclescpu cyclesp0      p1      p2      p3      p4      p5      p6      p7      
V2      16    1        2.696   3.004   0.776   0.714   0.000   0.000   0.000   0.731   0.811   0.000
V2      16    2        1.454   1.620   0.791   0.706   0.000   0.000   0.000   0.719   0.800   0.000
V2      16    4        0.918   1.022   0.836   0.679   0.000   0.000   0.000   0.696   0.795   0.000
V2      16    8        0.914   1.018   0.864   0.647   0.006   0.002   0.012   0.671   0.826   0.008
V2      16    16       1.277   1.414   0.834   0.652   0.237   0.263   0.334   0.685   0.887   0.161
V2      16    32       1.572   1.751   0.962   0.703   0.532   0.560   0.671   0.740   1.003   0.230

This also makes sense, we are slower, and with i=32, we for sure have too large register pressure (shown by the other ports used and in the assembly)... With i=0, we can verify, that p0+p1+p5+p6=~3.001, so no ILP of course. We could expect 4 cpu cycles, but the MOV is for free (register allocation).

With i=4: SHL is executed on p0 or p6, the ADD and MOV are both executed on p0, 1, 5, or 6. So we have 1 op on p0 or p6, and 2+1 ops (add/mov) on p0, p1, p5, or p6. Again, the MOV is for free, so we get 1 cycle per multiplication.

And finally with the optimized version:

  402730:   67 8d 7c ad 00          lea    0x0(%ebp,%ebp,4),%edi
  402735:   01 ff                   add    %edi,%edi
  402737:   67 43 8d 2c bf          lea    (%r15d,%r15d,4),%ebp
  40273c:   01 ed                   add    %ebp,%ebp
  40273e:   67 8d 1c bf             lea    (%edi,%edi,4),%ebx
  402742:   01 db                   add    %ebx,%ebx
  402744:   67 8d 7c ad 00          lea    0x0(%ebp,%ebp,4),%edi
  402749:   01 ff                   add    %edi,%edi
  40274b:   67 8d 2c 9b             lea    (%ebx,%ebx,4),%ebp
  40274f:   01 ed                   add    %ebp,%ebp
  402751:   67 8d 1c bf             lea    (%edi,%edi,4),%ebx
  402755:   01 db                   add    %ebx,%ebx
  402757:   67 8d 7c ad 00          lea    0x0(%ebp,%ebp,4),%edi
  40275c:   01 ff                   add    %edi,%edi
  40275e:   67 8d 2c 9b             lea    (%ebx,%ebx,4),%ebp
  402762:   01 ed                   add    %ebp,%ebp
  402764:   67 8d 1c bf             lea    (%edi,%edi,4),%ebx
  402768:   01 db                   add    %ebx,%ebx
  40276a:   67 8d 7c ad 00          lea    0x0(%ebp,%ebp,4),%edi
  40276f:   01 ff                   add    %edi,%edi
  402771:   67 8d 2c 9b             lea    (%ebx,%ebx,4),%ebp
  402775:   01 ed                   add    %ebp,%ebp
  402777:   67 8d 1c bf             lea    (%edi,%edi,4),%ebx
  40277b:   01 db                   add    %ebx,%ebx
  40277d:   67 44 8d 64 ad 00       lea    0x0(%ebp,%ebp,4),%r12d
  402783:   45 01 e4                add    %r12d,%r12d
  402786:   67 44 8d 2c 9b          lea    (%ebx,%ebx,4),%r13d
  40278b:   45 01 ed                add    %r13d,%r13d
  40278e:   67 43 8d 2c a4          lea    (%r12d,%r12d,4),%ebp
  402793:   01 ed                   add    %ebp,%ebp
  402795:   67 47 8d 7c ad 00       lea    0x0(%r13d,%r13d,4),%r15d
  40279b:   45 01 ff                add    %r15d,%r15d
  40279e:   ff c9                   dec    %ecx
  4027a0:   75 8e                   jne    402730 <_Z7measureIN5timer5rtdscE2V3Li16777216ELi4ELi2EEvv+0x170>

Measurements for V3

name    uroll ilp   ref cyclescpu cyclesp0      p1      p2      p3      p4      p5      p6      p7      
V3      16    1        1.797   2.002   0.447   0.558   0.000   0.000   0.000   0.557   0.469   0.000
V3      16    2        1.273   1.418   0.448   0.587   0.000   0.000   0.000   0.528   0.453   0.000
V3      16    4        0.774   0.835   0.449   0.569   0.000   0.000   0.000   0.548   0.442   0.000
V3      16    8        0.572   0.636   0.440   0.555   0.017   0.021   0.032   0.562   0.456   0.012
V3      16    16       0.753   0.838   0.433   0.630   0.305   0.324   0.399   0.644   0.458   0.165
V3      16    32       0.976   1.087   0.467   0.766   0.514   0.536   0.701   0.737   0.458   0.333

Okay, now we are faster than the imul. 2 cycles for i=0 (1 for both instructions), and for i=8, we are even faster:. The lea can be executed on p1 and p5, the add, as above, on p0, p1, p5, or p6. So if perfectly scheduled, the LEA goes to p1 and p5, the ADD to p0, or p6. Unfortunately, in this case it isn't that perfect (the assembly is fine). I suppose that scheduling is not perfect, and a few ADD land on the p1/p5 ports.

All code can be seen on gcc.godbolt.org (it has quite some template magic, but boils down to extremely simple code, which has been checked many times). Note that I removed the timers and some other stuff, which is not necessary for checking the assembly.

overseas
  • 1,711
  • 1
  • 18
  • 30
  • `2.78` cycles per IMUL is because you're measuring reference cycles, not core clock cycles, and you have Turbo enabled. On Haswell, `imul r,r, imm` is 1 uop (for port 1), with 3c latency, one per 1c throughput. (Souce: http://agner.org/optimize/). 2-component LEA is 1 uop for port 1 or 5, with 1c latency, and one per 0.5c throughput. Measure with perf counters (e.g. `perf stat ./a.out`) for better results. – Peter Cordes Nov 12 '16 at 23:34
  • The tradeoff between IMUL and LEA is interesting for multiplying by 10 or something, where it takes two instructions to do it without IMUL, but still has 2c latency instead of 3c. If fused-domain uop throughput is a bottleneck but latency isn't, then IMUL wins. IIRC, clang likes to use IMUL for `x*10` but gcc likes to use LEA+ADD, both with `-O3 -march=haswell`. – Peter Cordes Nov 12 '16 at 23:37
  • Is that IMUL inline-asm loop compiled without optimization? That would explain the extra MOV instructions. Luckily for you, Haswell handles those MOV instructions with zero latency, in the register-rename stage. – Peter Cordes Nov 12 '16 at 23:43
  • @PeterCordes Thanks for your comments. I integrated perf now and updated the measurements (would be great if you can comment on the difference I still observe between ref cycles / cpu cycles). I'm somehow confused why this question gets downvoted that often, because I am pretty sure my measurements get better, and the question isn't that bad for sure. So if anyone misses something, I'd prefer a comment rather than just downvotes (in principle I don't care about downvotes, but I'd like to do this measurements right). – overseas Nov 13 '16 at 16:18
  • Reference cycles (counted by RDTSC and the ref cycles performance counter) always count at the CPU's rated max frequency. (e.g. a "2.4GHz" CPU will count at that, regardless if the core running your code is turboed up to 3.1GHz, or down at 1.6GHz. RDTSC is currently designed to work well as a time source for stuff like gettimeofday, moreso than perf measurement.) Since everything your measuring scales perfectly with core clock cycles, you should only care about that; then you don't need warm-up loops and more importantly your numbers will match the CPU's real behaviour in whole-number cycles – Peter Cordes Nov 13 '16 at 17:44
  • e.g. your first set of results for core clock cycles exactly matches what you'd predict from the known fact that `imul r,r,imm` has 3c latency, one per 1c throughput: bottleneck on latency at one per 3c with no ILP, or at two per 3c. But with more ILP than that, bottleneck on one per clock throughput. – Peter Cordes Nov 13 '16 at 17:48
  • re: question downvotes: like I commented yesterday, the reason I downvoted is that you distract from the question of IMUL vs. LEA+ADD by talking about a whole bunch of un-optimized compiler output, which is totally meaningless and a waste of time. e.g. I just tried the `x + x<<2` version on Godbolt, and gcc/clang/icc all saw what it's doing and compiled it the same as `x*10` (https://godbolt.org/g/qCkXNe). If you didn't get that, you're doing something very wrong (like failing to use `-O2` or `-O3`). Once you rewrite the question to take out all the un-optimized crap, I'd love to upvote it – Peter Cordes Nov 13 '16 at 17:56
  • Oh, just found the paragraph you meant about ref cycles: *use acpi-cpufreq driver with governor performance fixed to 3.5 GHz*. That probably doesn't disable turbo. Turbo is always under hardware control, and can activate whenever software sets the clock to the max rated frequency. (Turbo clock-speed decisions take into account more info than SW has. Skylake can even do that for all clock speed decisions, not just turbo. The [the SKL power mgmt talk from IDF2015](http://myeventagenda.com/sessions/0B9F4191-1C29-408A-8B61-65D7520025A8/7/5#sessionID=155) compares HSW vs. SKL) – Peter Cordes Nov 13 '16 at 18:03
  • In the shift+add section: *i=32, we for sure have too large register pressure*?? Is that compiler output, or hand-written asm? Did it actually spill to memory, or is that just a guess without looking at the asm? The other effect will be that it's too many uops to run from the loopback buffer. It's all single-uop instructions though, so should run efficientsly from the uop cache. Have a look at [this Q&A for graphs of uop throughput vs. loop size](http://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of) – Peter Cordes Nov 13 '16 at 18:13
  • In your LEA version, why are you forcing the base and index registers to be different? And why have those MOV instructions? If you write it like `asm("lea (%[in],%[in],4), %[dst]\n\t"` `"add %[dst], %[dst]"` `: [dst]"=&r"(out) ` `:[in]"r"(input)` `);`, you should get optimal asm that reads from the old register directly, instead of copying R9 into registers for you to destroy. You probably bottleneck on the frontend, 4 fused-domain uops per clock. You should be getting nearly 2 multiplies per clock (with no loop overhead). – Peter Cordes Nov 13 '16 at 18:18
  • It would be a good idea to at least link to the source for this stuff on Godbolt, so we can see what asm is involved for each case of unroll and ILP. – Peter Cordes Nov 13 '16 at 18:22
  • Ok, modified the question without the "crap". This is all compiler output, when I use i=32, i see a lot of additional mov in the assembler output. – overseas Nov 13 '16 at 18:26
  • The LEA version: Great you have seen this. We are now at two cycles. Seems like two operations only take 1/2 cycle I assume. But not sure. I will check whether I can link the source to Godbolt... – overseas Nov 13 '16 at 18:31
  • So you are still using inline asm, and nowhere are you timing compiler output from the C loop with `r1 = r2 * 10` you show at the top of your answer (clang does some optimizations on it, e.g. it notices that the result is always zero, but still loops with -march=haswell) See https://godbolt.org/g/6PlRhQ. You should just take out that C++ source, and just keep the actual asm loops you're measuring. Don't imply you got that loop to compile to the asm you show. – Peter Cordes Nov 13 '16 at 23:34
  • In your inline-asm, you don't need `\r\n` after every instruction. That's why it all looks double-spaced. Use `\n\t` to start the next instruction on a new line, with a tab. I fixed that for you, but goo.gl won't shorten such a long godbolt link (and stupid SO comments don't support long URLs). Here's just the modified part: https://godbolt.org/g/weMVwW. It's pretty funny how gcc tries to auto-vectorize it or something, or fails to optimize away some of the storing into arrays, and is doing VPINSRD along with every multiply. – Peter Cordes Nov 13 '16 at 23:38
  • Of course I don't do r1 = r2 * 10 in the actual code. I wanted to stick at the three assembly versions given in the question, without any additional moves. The assemblies there are always with input != output register, so I also do this here (otherwise pressure on the registers would be diffferent). So my goal is to write a code that compiles exactly to the asm I show in the question, which in my opinion is given in this case. Another question is whether this is clear in my answer. – overseas Nov 14 '16 at 05:18
  • I changed the \r\n (I guess I started with some code where some participants worked on windows ...). The think with gcc is quite funny, and it only happened when I added the second array with two do_ilp_step in do_unroll_step (before it was also doing only the imul instructions). – overseas Nov 14 '16 at 05:22
  • Yeah, it was clear what you'd actually done, especially after looking at your godbolt link. This last update now fixes the answer so it doesn't start out misleading. – Peter Cordes Nov 14 '16 at 05:24
  • You have some errors in your conclusions, though: *the ADD and MOV are both executed on p0, 1, 5, or 6*: not quite. MOV doesn't need an execution port at all, and is "executed" at register-rename time. This is called "mov elimination". It's still a fused-domain uop, so you bottleneck on the issue / retire width of the out-of-order core, at 1 multiply per clock (IDK why you said 1.25; you only see that when you run out of registers and spill to memory). – Peter Cordes Nov 14 '16 at 05:28
  • *Seems like two of the three instructions only take 1/2 cycle*: Three instructions? LEA+ADD is only two instructions. And yes, that gives you a throughput of two multiplies per clock (in theory with perfect scheduling by the OOO core; in practice resource conflicts from ADDs stealing cycles on p1 or p5 reduce your throughput to one per 0.65 cycles). Like I said in my answer, simple-LEA can run on p1/p5, and ADD can run on any ALU port. – Peter Cordes Nov 14 '16 at 05:32
  • They were not up to date. Sorry for that. At some point I observed an additional move (that was removed due to register renaming I guess). – overseas Nov 14 '16 at 18:36
  • Compile-time avoiding of MOV instructions is not [register renaming](https://en.wikipedia.org/wiki/Register_renaming). If you want to call it something, it's maybe smarter *register allocation*. BTW, I don't think it's a good idea to remove the timing code from your Godbolt link. People might want to try the same code on their own CPUs. It's pretty easy to find the relevant part of the asm output, since the newlines between each group make it really obvious which part is from inline-asm. – Peter Cordes Nov 14 '16 at 18:46
  • 1
    Thanks for the correciton. Complete code is added. It's probably not that beatiful, it's just enough to work (and it's copied from multiple files, so probably the includes are not that "well-optimized"). – overseas Nov 14 '16 at 18:56