x86: Long loop-carried dependency chain. Why 13 cycles?

Question

I modified the code from a previous experiment (Agner Fog's Optimizing Assembly, example 12.10a) to make it more dependent:

movsd xmm2, [x] 
movsd xmm1, [one] 
xorps xmm0, xmm0  
mov eax, coeff    

L1:
    movsd xmm3, [eax]
    mulsd xmm3, xmm1
    mulsd xmm1, xmm2
    addsd xmm1, xmm3
    add   eax, 8
    cmp eax, coeff_end
    jb L1

And now it takes ~13 cycles per iteration, but I have no idea why so much. Please help me understand.

(update) I'm sorry. Yes, definetely @Peter Cordes is right- it takes 9 cycles per iteration in fact. The misunderstanding is caused by myself. I missed two similar pieces of codes ( instructions swapped), the 13-cycles code is here:

movsd xmm2, [x] 
movsd xmm1, [one] 
xorps xmm0, xmm0  
mov eax, coeff    

L1:
    movsd xmm3, [eax]

    mulsd xmm1, xmm2
    mulsd xmm3, xmm1      
    addsd xmm1, xmm3
    add   eax, 8
    cmp eax, coeff_end
    jb L1

In the loop, `mulsd xmm3,xmm1` cmt says `xmm1` is `x^i` but the `addsd` cmt implies `xmm1` is a sum? Also, you're only doing one double per iteration. Any way to unroll the loop to use the full reg width and process `c[i]` _and_ `c[i+1]` is a single iteration? Are you memory bound? At a minimum, you might fetch both words with a single inst and then use LSW and the MSW from the reg. The cmt on the `mulsd xmm1,xmm2` doesn't seem to match usage. And, `xmm1` won't stay as `x^i` because of the addsd. Does addsd need new dst reg (e.g. xmm4)? Maybe, you could post a working C function? — Craig Estey, May 20 '16 at 20:24
I believe you want your `addsd` to be `addsd xmm0,xmm3`. Then, it begins to make sense. It breaks a dependency. And, I do believe you can cut the number of loop iterations in half by using `movapd`, `addpd`, `mulpd`. That is, each reg has both `c[i], c[i+1]` in the halves. At the end, take each half of `xmm0` and sum them to get the total sum for each even/odd sum. You'll also need `add eax,16` — Craig Estey, May 20 '16 at 20:59
It's obvious why this should take at least 9 cycles/iter on Ivy, but I'm not sure where the rest comes from — harold, May 20 '16 at 21:12
@CraigEstey: The comments are the same as in the loop from a previous question. AFAICT, the OP changed the loop to experiment with dependencies, not to calculate anything useful, and neglected to remove the comments. — Peter Cordes, May 20 '16 at 21:15

score 3 · Accepted Answer · answered May 20 '16 at 21:14

It runs at exactly one iteration per 9c for me, on a Core2 E6600, which is expected:

      movsd xmm3, [eax] ; independent, depends only on eax

A:    mulsd xmm3, xmm1  ; 5c: depends on xmm1:C from last iteration
B:    mulsd xmm1, xmm2  ; 5c: depends on xmm1:C from last iteration
C:    addsd xmm1, xmm3  ; 3c: depends on xmm1:B from THIS iteration (and xmm3:A from this iteration)

When xmm1:C is ready from iteration i, the next iteration can start calculating:

A: producing xmm3:A in 5c
B: producing xmm1:B in 5c (but there's a resource conflict; these multiplies can't both start in the same cycle in Core2 or IvyBridge, only Haswell and later)

Regardless of which one runs first, both have to finish before C can run. So the loop-carried dependency chain is 5 + 3 cycles, +1c for the resource conflict that stops both multiplies from starting in the same cycle.

Test code that runs at the expected speed:

This slows down to one iteration per ~11c when the array is 8B * 128 * 1024. If you're testing with an even bigger array instead of using a repeat-loop around what you posted, then that's why you're seeing a higher latency.

If a load arrives late, there's no way for the CPU to "catch up", since it delays the loop-carried dependency chain. If the load was only needed in a dependency chain that forked off from the loop-carried chain, then the pipeline could absorb an occasional slow load more easily. So, some loops can be more sensitive to memory delays than others.

        default REL
%macro  IACA_start 0
     mov ebx, 111
     db 0x64, 0x67, 0x90
%endmacro
%macro  IACA_end 0
     mov ebx, 222
     db 0x64, 0x67, 0x90
%endmacro

global _start
_start:
        movsd   xmm2, [x]
        movsd   xmm1, [one]
        xorps   xmm0, xmm0
        mov     ecx, 10000

outer_loop:
        mov     eax, coeff
IACA_start                      ; outside the loop
ALIGN 32                        ; this matters on Core2, .78 insn per cycle vs. 0.63 without
L1:
        movsd   xmm3, [eax]
        mulsd   xmm3, xmm1
        mulsd   xmm1, xmm2
        addsd   xmm1, xmm3
        add     eax, 8
        cmp     eax, coeff_end
        jb      L1
IACA_end
        dec     ecx
        jnz     outer_loop

        ;mov    eax, 1
        ;int    0x80            ; exit() for 32bit code
        xor     edi, edi
        mov     eax, 231        ;  exit_group(0).  __NR_exit = 60.
        syscall


        section .data
x:
one:    dq 1.0

        section .bss
coeff:  resq 24*1024        ; 6 * L1 size.  Doesn't run any faster when it fits in L1 (resb)
coeff_end:

Experimental test

$ asm-link interiteration-test.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 interiteration-test.asm
+ ld -o interiteration-test interiteration-test.o

$ perf stat ./interiteration-test

 Performance counter stats for './interiteration-test':
    928.543744      task-clock (msec)         #    0.995 CPUs utilized          
           152      context-switches          #    0.164 K/sec                  
             1      cpu-migrations            #    0.001 K/sec                  
            52      page-faults               #    0.056 K/sec                  
 2,222,536,634      cycles                    #    2.394 GHz                      (50.14%)
 <not supported>      stalled-cycles-frontend  
 <not supported>      stalled-cycles-backend   
 1,723,575,954      instructions              #    0.78  insns per cycle          (75.06%)
   246,414,304      branches                  #  265.377 M/sec                    (75.16%)
        51,483      branch-misses             #    0.02% of all branches          (74.74%)

   0.933372495 seconds time elapsed

Each branch / every 7 instructions is one iteration of the inner loop.

$ bc -l
bc 1.06.95
1723575954 / 7
246225136.28571428571428571428
# ~= number of branches: good
2222536634 / .
9.026
# cycles per iteration

IACA agrees: 9c per iteration on IvB

(not counting the nops from ALIGN):

$ iaca.sh -arch IVB interiteration-test
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - interiteration-test
Binary Format - 64Bit
Architecture  - IVB
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 9.00 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 2.0    0.0  | 1.0  | 0.5    0.5  | 0.5    0.5  | 0.0  | 2.0  |
-------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |    | movsd xmm3, qword ptr [eax]
|   1    | 1.0       |     |           |           |     |     |    | mulsd xmm3, xmm1
|   1    | 1.0       |     |           |           |     |     | CP | mulsd xmm1, xmm2
|   1    |           | 1.0 |           |           |     |     | CP | addsd xmm1, xmm3
|   1    |           |     |           |           |     | 1.0 |    | add eax, 0x8
|   1    |           |     |           |           |     | 1.0 |    | cmp eax, 0x63011c
|   0F   |           |     |           |           |     |     |    | jb 0xffffffffffffffe7
Total Num Of Uops: 6

score 2 · Answer 2 · answered May 21 '16 at 22:38

With the addsd change suggested in my comments above, --> addsd xmm0,xmm3, this can be coded to use the full width registers and the performance is twice as fast.

Loosely:

For the initial value of ones, it needs to be:

double ones[2] = { 1.0, x }

And we need to replace x with x2:

double x2[2] = { x * x, x * x }

If there is an odd number of coefficients, pad it with a zero to produce an even number of them.

And, changing the pointer increment to 16.

Here are the test results I got. I did a number of trials and took the ones that had the best time and elongated the time by doing 100 iterations. std is the C version, dbl is your version, and qed is the "wide" version:

R=1463870188
C=100
T=100
I=100
x: 3.467957099973322e+00 3.467957099973322e+00
one: 1.000000000000000e+00 3.467957099973322e+00
x2: 1.202672644725538e+01 1.202672644725538e+01
std: 2.803772098439484e+56 (ELAP: 0.000019312)
dbl: 2.803772098439484e+56 (ELAP: 0.000019312)
qed: 2.803772098439492e+56 (ELAP: 0.000009060)
rtn_loop: 2.179378907910304e+55 2.585834207648461e+56
rtn_shuf: 2.585834207648461e+56 2.179378907910304e+55
rtn_add: 2.803772098439492e+56 2.585834207648461e+56

This was done on an i7 920 @ 2.67 GHz.

I think if you take the elapsed numbers and convert them, you'll see that your version is faster than you think.

I apologize, in advance, for switching to AT&T syntax as I had difficulty getting the assembler to work the other way. Again, sorry. Also, I'm using linux, so I used the rdi rsi registers to pass the coefficient pointers. If you're on windows, the ABI is different and you'll have to adjust for that.

I did a C version and diassembled it. It was virtually identical to your code except that it rearranged the non-xmm instructions a bit, which I've added below.

I believe I posted all the files, so you could conceivably run this on your system if you wished.

Here's the original code:

# xmmloop/dbl.s -- implement using single double

    .globl  dbl
# dbl -- compute result using single double
#
# arguments:
#   rdi -- pointer to coeff vector
#   rsi -- pointer to coeff vector end
dbl:
    movsd   x(%rip),%xmm2           # get x value
    movsd   one(%rip),%xmm1         # get ones
    xorps   %xmm0,%xmm0             # sum = 0

dbl_loop:
    movsd   (%rdi),%xmm3            # c[i]

    add     $8,%rdi                 # increment to next vector element
    cmp     %rsi,%rdi               # done yet?

    mulsd   %xmm1,%xmm3             # c[i]*x^i
    mulsd   %xmm2,%xmm1             # x^(i+1)
    addsd   %xmm3,%xmm0             # sum += c[i]*x^i

    jb      dbl_loop                # no, loop

    retq

Here's the code changed to use the movapd et. al:

# xmmloop/qed.s -- implement using single double

    .globl  qed
# qed -- compute result using single double
#
# arguments:
#   rdi -- pointer to coeff vector
#   rsi -- pointer to coeff vector end
qed:
    movapd  x2(%rip),%xmm2          # get x^2 value
    movapd  one(%rip),%xmm1         # get [1,x]
    xorpd   %xmm4,%xmm4             # sum = 0

qed_loop:
    movapd  (%rdi),%xmm3            # c[i]

    add     $16,%rdi                # increment to next coefficient
    cmp     %rsi,%rdi               # done yet?

    mulpd   %xmm1,%xmm3             # c[i]*x^i
    mulpd   %xmm2,%xmm1             # x^(i+2)
    addpd   %xmm3,%xmm4             # sum += c[i]*x^i

    jb      qed_loop                # no, loop

    movapd  %xmm4,rtn_loop(%rip)    # save intermediate DEBUG
    movapd  %xmm4,%xmm0             # get lower sum
    shufpd  $1,%xmm4,%xmm4          # get upper value into lower half
    movapd  %xmm4,rtn_shuf(%rip)    # save intermediate DEBUG
    addsd   %xmm4,%xmm0             # add upper sum to lower
    movapd  %xmm0,rtn_add(%rip)     # save intermediate DEBUG

    retq

Here's a C version of the code:

// xmmloop/std -- compute result using C code

#include <xmmloop.h>

// std -- compute result using C
double
std(const double *cur,const double *ep)
{
    double xt;
    double xn;
    double ci;
    double sum;

    xt = x[0];
    xn = one[0];
    sum = 0;

    for (;  cur < ep;  ++cur) {
        ci = *cur;                  // get c[i]
        ci *= xn;                   // c[i]*x^i
        xn *= xt;                   // x^(i+1)
        sum += ci;                  // sum += c[i]*x^i
    }

    return sum;
}

Here's the test program I used:

// xmmloop/xmmloop -- test program

#define _XMMLOOP_GLO_
#include <xmmloop.h>

// tvget -- get high precision time
double
tvget(void)
{
    struct timespec ts;
    double sec;

    clock_gettime(CLOCK_REALTIME,&ts);

    sec = ts.tv_nsec;
    sec /= 1e9;
    sec += ts.tv_sec;

    return sec;
}

// timeit -- get best time
void
timeit(fnc_p proc,double *cofptr,double *cofend,const char *tag)
{
    double tvbest;
    double tvbeg;
    double tvdif;
    double sum;

    sum = 0;

    tvbest = 1e9;

    for (int trycnt = 1;  trycnt <= opt_T;  ++trycnt) {
        tvbeg = tvget();

        for (int iter = 1;  iter <= opt_I;  ++iter)
            sum = proc(cofptr,cofend);

        tvdif = tvget();
        tvdif -= tvbeg;

        if (tvdif < tvbest)
            tvbest = tvdif;
    }

    printf("%s: %.15e (ELAP: %.9f)\n",tag,sum,tvbest);
}

// main -- main program
int
main(int argc,char **argv)
{
    char *cp;
    double *cofptr;
    double *cofend;
    double *cur;
    double val;
    long rseed;
    int cnt;

    --argc;
    ++argv;

    rseed = 0;
    cnt = 0;

    for (;  argc > 0;  --argc, ++argv) {
        cp = *argv;
        if (*cp != '-')
            break;

        switch (cp[1]) {
        case 'C':
            cp += 2;
            cnt = strtol(cp,&cp,10);
            break;

        case 'R':
            cp += 2;
            rseed = strtol(cp,&cp,10);
            break;

        case 'T':
            cp += 2;
            opt_T = (*cp != 0) ? strtol(cp,&cp,10) : 1;
            break;

        case 'I':
            cp += 2;
            opt_I = (*cp != 0) ? strtol(cp,&cp,10) : 1;
            break;
        }
    }

    if (rseed == 0)
        rseed = time(NULL);
    srand48(rseed);
    printf("R=%ld\n",rseed);

    if (cnt == 0)
        cnt = 100;
    if (cnt & 1)
        ++cnt;
    printf("C=%d\n",cnt);

    if (opt_T == 0)
        opt_T = 100;
    printf("T=%d\n",opt_T);

    if (opt_I == 0)
        opt_I = 100;
    printf("I=%d\n",opt_I);

    cofptr = malloc(sizeof(double) * cnt);
    cofend = &cofptr[cnt];

    val = drand48();
    for (;  val < 3;  val += 1.0);

    x[0] = val;
    x[1] = val;
    DMP(x);

    one[0] = 1.0;
    one[1] = val;
    DMP(one);

    val *= val;
    x2[0] = val;
    x2[1] = val;
    DMP(x2);

    for (cur = cofptr;  cur < cofend;  ++cur) {
        val = drand48();
        val *= 1e3;
        *cur = val;
    }

    timeit(std,cofptr,cofend,"std");
    timeit(dbl,cofptr,cofend,"dbl");
    timeit(qed,cofptr,cofend,"qed");
    DMP(rtn_loop);
    DMP(rtn_shuf);
    DMP(rtn_add);

    return 0;
}

And the header file:

// xmmloop/xmmloop.h -- common control

#ifndef _xmmloop_xmmloop_h_
#define _xmmloop_xmmloop_h_

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#ifdef _XMMLOOP_GLO_
#define EXTRN_XMMLOOP       /**/
#else
#define EXTRN_XMMLOOP       extern
#endif

#define XMMALIGN            __attribute__((aligned(16)))

EXTRN_XMMLOOP int opt_T;
EXTRN_XMMLOOP int opt_I;

EXTRN_XMMLOOP double x[2] XMMALIGN;
EXTRN_XMMLOOP double x2[2] XMMALIGN;
EXTRN_XMMLOOP double one[2] XMMALIGN;

EXTRN_XMMLOOP double rtn_loop[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_shuf[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_add[2] XMMALIGN;

#define DMP(_sym) \
    printf(#_sym ": %.15e %.15e\n",_sym[0],_sym[1]);

typedef double (*fnc_p)(const double *cofptr,const double *cofend);
double std(const double *cofptr,const double *cofend);
double dbl(const double *cofptr,const double *cofend);
double qed(const double *cofptr,const double *cofend);

#endif

Most distros package NASM and YASM, or you can use GNU as's `.intel_syntax noprefix` directive for a more MASM-like flavour of Intel syntax (same as used by `objdump -Mintel`, or `gcc -S -masm=intel`, with `dword ptr` and `offset symbol`). Noprefix refers to the `%` decorations on register names. — Peter Cordes, May 22 '16 at 01:45
@PeterCordes I tried `yasm` [and looked for `masm`(?)], but did not know about [or had forgotten about] `nasm`. I also tried `as` with the command line option `-msyntax=intel`. This was all done against original code/syntax After that, it was late, and I just punted. So, thanks. I still have the original file, so I'll have to find the combo that works, relative to what I have on my system. — Craig Estey, May 22 '16 at 02:27
I prefer `yasm`. It's almost 100% nasm compatible, as far as syntax and directives. It defaults to making nice `nop`s which don't clog the disassembly output (or slow the CPU down if they ever run). Anyway, whatever works if you're comfortable with AT&T syntax. — Peter Cordes, May 22 '16 at 02:37
@PeterCordes I tried `yasm` first [with varying options], but couldn't get it to work. IIRC, I had used it [successfully] before. But, then, I realized I wasn't comfortable with the syntax, and since this was "heavy lifting" with xmm, I decided to switch over. If it had been non-xmm, I probably would have been okay in the intel syntax, but I didn't want to take two risks at the same time. I think I'm going to write a wrapper script that does a quick scan of the file for syntax, then looks for the correct _installed_ assembler, with fallbacks as needed — Craig Estey, May 22 '16 at 02:45

x86: Long loop-carried dependency chain. Why 13 cycles?

2 Answers2

Test code that runs at the expected speed:

Experimental test

IACA agrees: 9c per iteration on IvB