Time performance SIMD assembly programs: longer loop executes faster

Question

I have recently been learning about SIMD in assembly (x86_64), and had some unexpected results. What it comes down to is the following.

I have two programs that run through a loop a number of times. The first program contains a loop that executes 4 SIMD instructions, the second contains this exact same loop with one extra instruction. The codes look like this:

The first program:

section .bss
doublestorage: resb 8

section .text
    global _start

_start:
    mov rax, 0x0000000100000001
    mov [doublestorage], rax
    cvtpi2pd xmm1, [doublestorage]
    cvtpi2pd xmm2, [doublestorage]
    cvtpi2pd xmm3, [doublestorage]
    cvtpi2pd xmm4, [doublestorage]
    cvtpi2pd xmm5, [doublestorage]
    cvtpi2pd xmm6, [doublestorage]
    cvtpi2pd xmm7, [doublestorage]

    mov rax, (1 << 31)
loop:
    movupd xmm1, xmm3
    movupd xmm2, xmm5
    divpd xmm1, xmm2
    addpd xmm4, xmm1
    dec rax
    jnz loop

    mov rax, 60
    mov rdi, 0
    syscall

The second program:

section .bss
doublestorage: resb 8

section .text
    global _start

_start:
    mov rax, 0x0000000100000001
    mov [doublestorage], rax
    cvtpi2pd xmm1, [doublestorage]
    cvtpi2pd xmm2, [doublestorage]
    cvtpi2pd xmm3, [doublestorage]
    cvtpi2pd xmm4, [doublestorage]
    cvtpi2pd xmm5, [doublestorage]
    cvtpi2pd xmm6, [doublestorage]
    cvtpi2pd xmm7, [doublestorage]

    mov rax, (1 << 31)
loop:
    movupd xmm1, xmm3
    movupd xmm2, xmm5
    divpd xmm1, xmm2
    addpd xmm4, xmm1
    movupd xmm6, xmm7
    dec rax
    jnz loop

    mov rax, 60
    mov rdi, 0
    syscall

Now, my line of thought was the following: the second program has more instructions to execute, so it will take considerably longer to execute. If I time both programs, though, the second program turns out to take less time to complete than the first program. I ran both programs a total number of 100 times, and the results are:

Runtime first program: mean: 5.6129 s, standard deviation: 0.0156 s
Runtime second program: mean: 5.5056 s, standard deviation: 0.0147 s

I conclude that the second program runs considerably faster. These results seem counterintuitive to me, so I was wondering what could be the reason for this behavior.

To be complete, I am running Ubuntu 15.10 and the NASM compiler (-elf64) and using an Intel Core i7-5600. Also, I checked the disassembly and no optimizations had been made by the compiler:

Objdump of the first program:

exec/instr4:     file format elf64-x86-64


Disassembly of section .text:

00000000004000b0 <.text>:
  4000b0:   48 b8 01 00 00 00 01    movabs $0x100000001,%rax
  4000b7:   00 00 00 
  4000ba:   48 89 04 25 28 01 60    mov    %rax,0x600128
  4000c1:   00 
  4000c2:   66 0f 2a 0c 25 28 01    cvtpi2pd 0x600128,%xmm1
  4000c9:   60 00 
  4000cb:   66 0f 2a 14 25 28 01    cvtpi2pd 0x600128,%xmm2
  4000d2:   60 00 
  4000d4:   66 0f 2a 1c 25 28 01    cvtpi2pd 0x600128,%xmm3
  4000db:   60 00 
  4000dd:   66 0f 2a 24 25 28 01    cvtpi2pd 0x600128,%xmm4
  4000e4:   60 00 
  4000e6:   66 0f 2a 2c 25 28 01    cvtpi2pd 0x600128,%xmm5
  4000ed:   60 00 
  4000ef:   66 0f 2a 34 25 28 01    cvtpi2pd 0x600128,%xmm6
  4000f6:   60 00 
  4000f8:   66 0f 2a 3c 25 28 01    cvtpi2pd 0x600128,%xmm7
  4000ff:   60 00 
  400101:   b8 00 00 00 80          mov    $0x80000000,%eax
  400106:   66 0f 10 cb             movupd %xmm3,%xmm1
  40010a:   66 0f 10 d5             movupd %xmm5,%xmm2
  40010e:   66 0f 5e ca             divpd  %xmm2,%xmm1
  400112:   66 0f 58 e1             addpd  %xmm1,%xmm4
  400116:   48 ff c8                dec    %rax
  400119:   75 eb                   jne    0x400106
  40011b:   b8 3c 00 00 00          mov    $0x3c,%eax
  400120:   bf 00 00 00 00          mov    $0x0,%edi
  400125:   0f 05                   syscall

Objdump of the second program:

exec/instr5:     file format elf64-x86-64


Disassembly of section .text:

00000000004000b0 <.text>:
  4000b0:   48 b8 01 00 00 00 01    movabs $0x100000001,%rax
  4000b7:   00 00 00 
  4000ba:   48 89 04 25 2c 01 60    mov    %rax,0x60012c
  4000c1:   00 
  4000c2:   66 0f 2a 0c 25 2c 01    cvtpi2pd 0x60012c,%xmm1
  4000c9:   60 00 
  4000cb:   66 0f 2a 14 25 2c 01    cvtpi2pd 0x60012c,%xmm2
  4000d2:   60 00 
  4000d4:   66 0f 2a 1c 25 2c 01    cvtpi2pd 0x60012c,%xmm3
  4000db:   60 00 
  4000dd:   66 0f 2a 24 25 2c 01    cvtpi2pd 0x60012c,%xmm4
  4000e4:   60 00 
  4000e6:   66 0f 2a 2c 25 2c 01    cvtpi2pd 0x60012c,%xmm5
  4000ed:   60 00 
  4000ef:   66 0f 2a 34 25 2c 01    cvtpi2pd 0x60012c,%xmm6
  4000f6:   60 00 
  4000f8:   66 0f 2a 3c 25 2c 01    cvtpi2pd 0x60012c,%xmm7
  4000ff:   60 00 
  400101:   b8 00 00 00 80          mov    $0x80000000,%eax
  400106:   66 0f 10 cb             movupd %xmm3,%xmm1
  40010a:   66 0f 10 d5             movupd %xmm5,%xmm2
  40010e:   66 0f 5e ca             divpd  %xmm2,%xmm1
  400112:   66 0f 58 e1             addpd  %xmm1,%xmm4
  400116:   66 0f 10 f7             movupd %xmm7,%xmm6
  40011a:   48 ff c8                dec    %rax
  40011d:   75 e7                   jne    0x400106
  40011f:   b8 3c 00 00 00          mov    $0x3c,%eax
  400124:   bf 00 00 00 00          mov    $0x0,%edi
  400129:   0f 05                   syscall

There can be data dependency. The time is bound by div unit, but in the first loop the divisor is constant. Some constants (subnormals) may execute much slower. — Aki Suihkonen, Apr 15 '16 at 16:42
You'd better have a statistically significant sample if you want to discuss 0.6% performance difference. — EOF, Apr 15 '16 at 16:44
The additional instruction can be performed at the same time as everything else in the loop — I'd dare expect it to add zero cost simply because now the processor gets to use an arithmetic unit for a cycle that it otherwise wasn't using it for. So same speed isn't surprising; faster is a surprise unless you've just happened somehow to make the processor sufficiently confident about another out-of-order opportunity where before it wasn't certain. Those decisions have a time cost ceiling so are often broad phase only. — Tommy, Apr 15 '16 at 16:48
@Aki Suihkonen That doesn't seem to be the whole answer. Even if I initialize the xmm registers to the same values and replace the instruction I add by one that doesn't influence the others, I obtain a similar behavior. I'll update the question elaborating. — arriopolis, Apr 15 '16 at 16:55
@EOF You're right, I'll take a statistical sample and update the results in the question. — arriopolis, Apr 15 '16 at 16:58
@Tommy It turns out that on my system the second program does execute faster indeed. — arriopolis, Apr 15 '16 at 17:42
Could you `objdump` the executables? I wonder if there could be an issue with macro-op fusion of the `dec; jnz`-pair if they are in different 16-byte aligned blocks. — EOF, Apr 15 '16 at 17:55
Well, there goes that idea. I don't see any obvious dependency/out-of-order explanations for this. There are a few esoteric issues like micro-op caches or loop-buffers that *could* explain this, but that's rather involved. — EOF, Apr 15 '16 at 18:20
@EOF Ok, welk thanks for having a look anyway. I am setting all xmm registers to 1 though right? That way I tried to avoid dividing by 0. — arriopolis, Apr 15 '16 at 18:22
Yeah, I was confused about the contents of the `doublestorage`. Anyway, AFAIK there's a tool Intel provides to analyze the cycle-exact execution of such code snippets. — EOF, Apr 15 '16 at 18:24
@EOF That looks like a good option. I'll have a look into it, thanks! — arriopolis, Apr 15 '16 at 19:00
IACA only supports up to Haswell. It doesn't say anything other than 14c per iteration for both loops, bottlenecking on divider throughput. (Agner Fog's measurements for `divpd` are one per 8-14c throughput on Haswell, one per 8c on Broadwell .) There was [a recent question about broadwell throughput](http://stackoverflow.com/questions/34309707/significant-fma-performance-anomaly-experienced-in-the-intel-broadwell-processor), but that's about saturating the frontend. Possibly one of the `movupd` insns isn't being eliminated, and is stealing p0 from `divpd` for a cycle sometimes? — Peter Cordes, Apr 16 '16 at 06:09

score 1 · Answer 1 · edited May 23 '17 at 12:15

There's no such thing as an "i7 5600". I assume you mean i7 5600U, which is a low-power (15W TDO) CPU with 2.6GHz base / 3.2GHz turbo.

Can you double-check that this is reproducible? Make sure CPU clock speed stays constant across both tests, because your low-power CPU may not be able to run at full turbo with the divide unit busy all the time.

Maybe also useful to test with perf counters (e.g. perf stat ./a.out), measuring core clock cycles. (not "reference" cycles. You want to count actual cycles that the clock is actually running at.)

IACA only supports up to Haswell. It doesn't say anything other than 14c per iteration for both loops, bottlenecking on divider throughput. (Agner Fog's measurements for divpd are one per 8-14c throughput on Haswell, one per 8c on Broadwell.)

There was a recent question about broadwell throughput, but that's about saturating the frontend.

This loop should be purely bottlenecked on divpd throughput (one per 8c on Broadwell). If the effect is real, the only explanation I've come up with is that one of the movupd insns isn't always being eliminated, and is stealing p0 from divpd for a cycle sometimes.

The three unfused-domain uops in the loop all run on different ports, so there's no way they can delay each other. (divpd on p0, addpd on p1, and predicted-taken fused cmp/jcc on p6).

Actually, even that theory doesn't hold water. Non-eliminated movaps xmm,xmm use port 5 on Broadwell. I assume the odd choice of movupd xmm,xmm also decodes to a port5 uop. (Agner Fog doesn't even list an entry for the reg-reg form of movups/movupd, because everyone always uses movaps. Or movapd if they like matching insn type to the data, even though it's a byte longer and no existing uarch cares about s vs d for bypass delays, only movaps for float/double and movdqa for integer.)

Interestingly, my 2.4GHz E6600 (Conroe/merom microarch) runs your loop in 4.5s. Agner Fog's tables list divpd on Merom as one per 5-31c. 1.0/1.0 presumably happens in 5c. Sandybridge has significantly slower best-case divide than Nehalem. It's only with Skylake that the best-case throughputs are down as fast as Merom. (Throughput fixed at one per 4c for 128b divpd).

BTW, here's a version of your code that uses a couple more normal ways to set up FP data in regs:

    default REL              ;  use RIP-relative for access to globals by default, so you don't have to write `[rel my_global]`

section .rodata
    ; ALIGN 16               ; only desirable if we're doing a 128b load instead of a 64b broadcast-load
one_dp: dq   1.0

section .text

global _start
_start:
    mov      rax, 0x0000000100000001
    mov      [rsp-16], rax     ; if you're going to store/reload, use the stack for scratch space, not a static location that will probably cache-miss.
    cvtpi2pd xmm1, [rsp-16]    ; this is the form with xmm, mm/m64 args.   Interestingly, for the memory operand form, this should perform the same but saves a byte in the encoding.
    cvtdq2pd xmm8, [rsp-16]    ; this is the "normal" insn, with  xmm, xmm/m64 args.
    movq     xmm9, rax
    cvtdq2pd xmm9, xmm9
    ; Fun fact: 64bit int <-> FP is only available for scalar until AVX-512 introduces packed conversions for qword integer vectors.

    ;mov      eax, 1      ; still 1 from earlier
    cvtsi2sd xmm2, eax
    unpcklpd xmm2, xmm2   ; broadcast

    pcmpeqw  xmm3,xmm3          ; generate the constant on the fly, from Agner Fog's asm guide
    psllq    xmm3, 54
    psrlq    xmm3, 2            ; (double)1.0 in both halves.

    movaps   xmm4, xmm3         ; duplicate the data instead of re-doing the conversion.  Shorter and cheaper.
    movaps   xmm5, xmm3
    movaps   xmm6, xmm3

    ;movaps   xmm7, [ones]      ; load 128b constant
    movddup   xmm7, [one_dp]        ; broadcast-load 

    mov eax, (1 << 31)          ; 1<<31 fits in a 32bit reg just fine.
;IACA_start
.loop:
    ;movupd   xmm1, xmm3
    ;movupd   xmm2, xmm5
    movaps   xmm1, xmm3      ; prob. no perf diff, but weird to see movupd used for reg-reg moves
    movaps   xmm2, xmm5
    divpd    xmm1, xmm2
    addpd    xmm4, xmm1
    ;movaps  xmm6, xmm7
    dec      eax
    jnz    .loop
;IACA_end

    mov eax, 60
    xor edi,edi
    syscall                     ; exit(0)

For IACA, I used:

%macro  IACA_start 0
     mov ebx, 111
     db 0x64, 0x67, 0x90
%endmacro
%macro  IACA_end 0
     mov ebx, 222
     db 0x64, 0x67, 0x90
%endmacro

Did you try stripping the loop down even more?

.loop:
    movapd   xmm1, xmm3       ; replace the only operand that div writes
    divpd    xmm1, xmm2
    addpd    xmm4, xmm1
    dec      eax
    jnz    .loop

Then the loop is only 4 fused-domain uops. Should make zero difference, since divpd throughput should still be the only bottleneck.

Or with AVX:

    vdivpd    xmm0, xmm1, xmm2
    vaddpd    xmm4, xmm4, xmm0
    cmp/jcc

You're right, I'm using the i7-5600U. Secondly, I did another simulation today and alternated the first and second program in an attempt to reduce any unwanted effects. The results were similar: `5.5282 +/- 0.0094` vs. `5.4285 +/- 0.0092`, indicating that the second program is indeed faster. — arriopolis, Apr 16 '16 at 12:37
Stripping the loop down by removing the `movapd xmm2, xmm5` instruction resulted in more logical results, it caused the second program to run a little bit slower than the first, which sounds reasonable. Furthermore, changing to `movapd` or `movaps` does not seem to influence the timing significantly. — arriopolis, Apr 16 '16 at 12:49
@arriopolis: saving a byte with movaps instead of movapd will make zero difference, since the loop executes from the loop buffer. It doesn't have to be decoded every iteration. The code-size benefit is just for overall code size / L1 cache size in the big picture of the program you might use this code in. Also, removing `movapd xmm2,xmm5` shouldn't have any effect on the timing either. The expected result is that the loop runs at one iteration per 8 cycles, up to the point where there are enough other insns to be a bottleneck (which is nowhere near happening here). — Peter Cordes, Apr 16 '16 at 12:54
@arriopolis: With [`ocperf.py`](http://halobates.de/blog/p/245), have a look at the perf counters for `uops_dispatched.port0` or whatever the exact name of that counter is, and see if there are more uops being issued to p0 than just the divide. (AFK several hours) — Peter Cordes, Apr 16 '16 at 12:58
Well it looks like removing the `movapd xmm2, xmm5` actually did influence the timing, as with the extra move, the second program executes faster, and without it, the first program executes faster. — arriopolis, Apr 16 '16 at 13:08

Time performance SIMD assembly programs: longer loop executes faster

1 Answers1

Did you try stripping the loop down even more?