I wanted to benchmark the time needed to do a single addition on my Skylake (i5-6500) CPU. C is low-level enough for me, so I wrote the following code:
// Initializing stuffs
int a = rand();
int b = rand();
const unsigned long loop_count = 1000000000;
unsigned int ignored; // used for __rdtscp
// Warming up whatever needs to be warmed up
for (int i = 0; i < 100000; i++) {
asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
a += b;
}
// The actual measurement
uint64_t timer = __rdtscp(&ignored);
for (unsigned long i = 0; i < loop_count; i++) {
asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
a += b;
}
timer = __rdtscp(&ignored) - timer;
printf("%.2f cycles/iteration\n", (double)timer / loop_count);
Compiling with Clang 7.0.0 -O3, I get the following assembly (for the loop only):
# %bb.2:
rdtscp
movq %rdx, %rdi
movl %ecx, 4(%rsp)
shlq $32, %rdi
orq %rax, %rdi
movl $1000000000, %eax # imm = 0x3B9ACA00
.p2align 4, 0x90
.LBB0_3: # =>This Inner Loop Header: Depth=1
#APP
#NO_APP
addl %esi, %ebx
addq $-1, %rax
jne .LBB0_3
# %bb.4:
rdtscp
And running this code outputs
0.94 cycles/iteration
(or a number pretty much always between 0.93 and 0.96)
I'm surprised that this loop can execute in less than 1 cycle/iteration, since there is a data dependency on a
that should prevent parallel execution of a += b
.
IACA
also confirms that the expected throughput is 0.96 cycles. llvm-mca
on the other hand predicts a total of 104 cycles to execute 100 iterations of the loop. (I can edit in the traces if needed; let me know)
I observe a similar behavior when I use SSE registers rather than general purpose ones.
I can imagine that the CPU is smart enough to notice that b
is constant and since addition is commutative, it could unroll the loop and optimize the additions somehow. However, I've never heard nor read anything about this. And furthermore, if this was what's going on, I'd expect better performances (ie. fewer cycles/iteration) than 0.94 cycles/iteration.
What is going on? How is this loop able to execute in less than 1 cycle per iteration?
Some background, for completeness. Ignore the remaining of the question if you're not interested in why I'm trying to benchmark a single addition.
I know that there are tools (llvm-exegesis for instance) designed to benchmark a single instruction and that I should instead of them (or just look at agner fog's docs). However, I'm actually trying to compare three different additions: one doing a single addition in a loop (the object of my question); one doing 3 additions per loop (on SSE registers, that should maximize port usage and not be limited by data dependencies), and one where the addition is implemented as a circuit in software. While the results are mostly as I expected; the 0.94 cycles/iteration for the version with a single addition in a loop left me puzzled.