Loop takes less than 1 cycle despite dependency between iterations

Question

I wanted to benchmark the time needed to do a single addition on my Skylake (i5-6500) CPU. C is low-level enough for me, so I wrote the following code:

// Initializing stuffs
int a = rand();
int b = rand();
const unsigned long loop_count = 1000000000;
unsigned int ignored; // used for __rdtscp 

// Warming up whatever needs to be warmed up
for (int i = 0; i < 100000; i++) {
    asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
    a += b;
}

// The actual measurement
uint64_t timer = __rdtscp(&ignored);
for (unsigned long i = 0; i < loop_count; i++) {
    asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
    a += b;
}
timer = __rdtscp(&ignored) - timer;

printf("%.2f cycles/iteration\n", (double)timer / loop_count);

Compiling with Clang 7.0.0 -O3, I get the following assembly (for the loop only):

# %bb.2:
    rdtscp
    movq    %rdx, %rdi
    movl    %ecx, 4(%rsp)
    shlq    $32, %rdi
    orq %rax, %rdi
    movl    $1000000000, %eax       # imm = 0x3B9ACA00
    .p2align    4, 0x90
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
    #APP
    #NO_APP
    addl    %esi, %ebx
    addq    $-1, %rax
    jne .LBB0_3
# %bb.4:
    rdtscp

And running this code outputs

0.94 cycles/iteration

(or a number pretty much always between 0.93 and 0.96)

I'm surprised that this loop can execute in less than 1 cycle/iteration, since there is a data dependency on a that should prevent parallel execution of a += b.

IACA also confirms that the expected throughput is 0.96 cycles. llvm-mca on the other hand predicts a total of 104 cycles to execute 100 iterations of the loop. (I can edit in the traces if needed; let me know)

I observe a similar behavior when I use SSE registers rather than general purpose ones.

I can imagine that the CPU is smart enough to notice that b is constant and since addition is commutative, it could unroll the loop and optimize the additions somehow. However, I've never heard nor read anything about this. And furthermore, if this was what's going on, I'd expect better performances (ie. fewer cycles/iteration) than 0.94 cycles/iteration.

What is going on? How is this loop able to execute in less than 1 cycle per iteration?

Some background, for completeness. Ignore the remaining of the question if you're not interested in why I'm trying to benchmark a single addition.

I know that there are tools (llvm-exegesis for instance) designed to benchmark a single instruction and that I should instead of them (or just look at agner fog's docs). However, I'm actually trying to compare three different additions: one doing a single addition in a loop (the object of my question); one doing 3 additions per loop (on SSE registers, that should maximize port usage and not be limited by data dependencies), and one where the addition is implemented as a circuit in software. While the results are mostly as I expected; the 0.94 cycles/iteration for the version with a single addition in a loop left me puzzled.

What's the TSC frequency on your chip? Run `dmesg | grep 'tsc'` to find out. What's the core frequency? Is the core frequency fixed? — Hadi Brais, Jan 17 '20 at 14:21
@HadiBrais Good catch. Fixing the CPU frequency solved the issue. I thought that setting cpufreq governor to "performance" was enough. But I've set the frequency to 3.2GHz (`cpupower frequency-set -f 3.2G`), and now I get 1.00 cycles/iteration, as expected. Want to post that as an answer? ;) — Dada, Jan 17 '20 at 14:29
Could be closed as a duplicate of [How to get the CPU cycle count in x86\_64 from C++?](//stackoverflow.com/a/51907627) where my answer explains that RDTSC counts in reference cycles. There might be a more specific question where the question is more like yours, and the answer focuses on just that. This is very much a well-known fact about how RDTSC works, though, so it doesn't seem worth a *long* answer, but maybe a short one with links if someone wants to write one. — Peter Cordes, Jan 17 '20 at 18:08

score 3 · Accepted Answer · answered Jan 18 '20 at 03:49

3

The core frequency and the TSC frequency can be different. Your loop is expected to run at 1 core cycles per iteration. If the core frequency happens to be twice as the TSC frequency for the duration of the loop execution, the throughput would be 0.5 TSC cycles per iteration, which is equivalent to 1 core cycles per iteration.

In your case, it appears that the average core frequency was slightly higher than the TSC frequency. If you don't want to take dynamic frequency scaling into account when doing experiments, it'd be easier to just fix the core frequency to be equal to the TSC frequency so that you don't have to convert the numbers. Otherwise, you'd have to measure the average the core frequency as well.

On processors that support per-core frequency scaling, you have to either fix the frequency on all the cores or pin the experiments to a single core with fixed frequency. Alternatively, instead of measuring in TSC cycles, you can use a tool like perf to easily measure the time in core cycles or seconds.

See also: How to get the CPU cycle count in x86_64 from C++?.

answered Jan 18 '20 at 03:49

Hadi Brais

22,259
3
54
95

1

Re: using `perf` to measure in core clock cycles in the first place: I often use `perf stat` on a program that has my workload in a big repeat loop so it totally dominates the startup overhead when running for 100ms to 1 second, or longer if you want more significant figures of accuracy. But if your loop bottlenecks in a way that leads to a whole number of cycles per iteration, you don't really need to bother. (Ideally statically linked, easy when writing in asm, or in C write a `_start` that exits with asm.) Or the more advanced option is to set things up so you can `rdpmc` from user-space. – Peter Cordes Jan 18 '20 at 04:19
@PeterCordes why don't you need to bother in that case? Because the numbers always come out like `a,aaa,000,xxx` due to the integer number of cycles in the payload, and you can just look at the `a,aaa` part? – BeeOnRope Jan 23 '20 at 20:31
1

@BeeOnRope: yeah, exactly. If you can assume that the non-integer part of cycles/iter is just noise, even a 10ms timing run is ok. (If you aren't doing memory stuff that needs to page-fault many pages the first time through... In which case assuming an integer bottleneck would be a bad assumption in the first place. Memory access makes everything complicated.) – Peter Cordes Jan 24 '20 at 00:44

Loop takes less than 1 cycle despite dependency between iterations

1 Answers1