intel.com: Skylake Throughput (CPI) 0.5
But llvm-mca returns
$ llvm-mca kkk -mcpu=skylake -timeline --timeline-max-iterations=10 --timeline-max-cycles=999
...
Timeline view:
0123456789 0123456789 0123456789
Index 0123456789 0123456789 0123456789 012
[0,0] DeER . . . . . . . . . . . . . adcxq %rax, %rbx
[0,1] D=eER. . . . . . . . . . . . . adoxq %rcx, %rdx
[0,2] D==eER . . . . . . . . . . . . adcxq %rsp, %rbp
[0,3] D===eER . . . . . . . . . . . . adoxq %rsi, %rdi
[0,4] D====eER . . . . . . . . . . . . adcxq %r8, %r9
[0,5] D=====eER . . . . . . . . . . . . adoxq %r10, %r11
[1,0] .D=====eER. . . . . . . . . . . . adcxq %rax, %rbx
[1,1] .D======eER . . . . . . . . . . . adoxq %rcx, %rdx
[1,2] .D=======eER . . . . . . . . . . . adcxq %rsp, %rbp
[1,3] .D========eER . . . . . . . . . . . adoxq %rsi, %rdi
[1,4] .D=========eER . . . . . . . . . . . adcxq %r8, %r9
[1,5] .D==========eER. . . . . . . . . . . adoxq %r10, %r11
which only executes one instruction every cycle. Why?
It should be sure that there's some bug in llvm-mca:
Index 0123456789
[0,0] DeER . . adcxq %rax, %rbx
[0,1] D=eER. . adoxq %rcx, %rdx
[0,2] D==eER . adcxq %rsp, %rbp
[0,3] D===eER . adoxq %rsi, %rdi
[0,4] .D===eER . adcxq %r8, %r9
[0,5] .D====eER . adoxq %r10, %r11
[0,6] .DeE----R . decq %r15
[0,7] .D=eE---R . jne z
[1,0] . DeE---R . adcxq %rax, %rbx
[1,1] . D=eE--R . adoxq %rcx, %rdx
[1,2] . D==eE-R . adcxq %rsp, %rbp
[1,3] . D===eER . adoxq %rsi, %rdi
[1,4] . D===eER. adcxq %r8, %r9
[1,5] . D====eER adoxq %r10, %r11
[1,6] . DeE----R decq %r15
[1,7] . D=eE---R jne z
After decq
, [1,0]adcxq
is claimed executed on cycle 3, while it relies on result from cycle 5. adoxq
can be executed early though. Looks like it's another thread as this also applies to an llvm community confirmed that "We only have an EFLAGS register modeled." and fixing that should fix bothinc
in adcq
chain