Why cannot my program reach integer addition instruction throughput bound?

Question

I have read chapter 5 of CSAPP 3e. I want to test if the optimization techniques described in the book can work on my computer. I write the following program:

#define SIZE (1024)
int main(int argc, char* argv[]) {
  int sum = 0;
  int* array = malloc(sizeof(int) * SIZE);
  unsigned long long before = __rdtsc();
  for (int i = 0; i < SIZE; ++i) {
    sum += array[i];
  }
  unsigned long long after = __rdtsc();
  double cpe = (double)(after - before) / SIZE;
  printf("CPE is %f\n", cpe);
  printf("sum is %d\n", sum);
  return 0;
}

and it reports the CPE is around 1.00.

I transform the program using the 4x4 loop unrolling technique and it leads to the following program:

#define SIZE (1024)
int main(int argc, char* argv[]) {
  int sum = 0;
  int* array = malloc(sizeof(int) * SIZE);

  int sum0 = 0;
  int sum1 = 0;
  int sum2 = 0;
  int sum3 = 0;
  /* 4x4 unrolling */
  unsigned long long before = __rdtsc();
  for (int i = 0; i < SIZE; i += 4) {
    sum0 += array[i];
    sum1 += array[i + 1];
    sum2 += array[i + 2];
    sum3 += array[i + 3];
  }
  unsigned long long after = __rdtsc();
  sum = sum0 + sum1 + sum2 + sum3;
  double cpe = (double)(after - before) / SIZE;
  printf("CPE is %f\n", cpe);
  printf("sum is %d\n", sum);
  return 0;
}

Note that I omit the code to handle the situation when SIZE is not a multiple of 4. This program reports the CPE is around 0.80.

My program runs on an AMD 5950X, and according to AMD's software optimization manual (https://developer.amd.com/resources/developer-guides-manuals/), the integer addition instruction has a latency of 1 cycle and throughput of 4 instructions per cycle. It also has a load-store unit which could execute three independent load operations at the same time. My expectation of the CPE is 0.33, and I do not know why the result is so much higher.

My compiler is gcc 12.2.0. All programs are compiled with flags -Og.

I checked the assembly code of the optimized program, but found nothing helpful:

.L4:
        movslq  %r9d, %rcx
        addl    (%r8,%rcx,4), %r11d
        addl    4(%r8,%rcx,4), %r10d
        addl    8(%r8,%rcx,4), %ebx
        addl    12(%r8,%rcx,4), %esi
        addl    $4, %r9d
.L3:
        cmpl    $127, %r9d
        jle     .L4

I assume at least 3 of the 4 addl instructions should execute in parallel. However, the result of the program does not meet my expectation.

I would not only count the cycles per operation but also instruction pipeline and cache memory hits or misses. Usually the modern C compilers do a great job in optimization. I would expect that hand-coded optimization can be worse the compiler optimized code. — harper, Jan 20 '23 at 11:48
Also fill your input to possibly avoid a page fault and check results (and avoid unexpected optimizations). Note SIMD instructions can do that much more efficiently. (By the way, this is sad Zen is not supported by uiCA) — Jérôme Richard, Jan 20 '23 at 12:27
@JérômeRichard: Or just use stack space, initialized or not. (Perhaps with inline `asm("" : "=m"(array))` to tell the compiler it was written, without actually running any instructions to write it. — Peter Cordes, Jan 20 '23 at 12:44
Don't use `-Og`, that's optimize for debugging experience or some such. Use `-O3`. — Lundin, Jan 20 '23 at 13:00
@Lundin: The OP isn't trying to benchmark C, they're trying to create a scalar asm microbenchmark loop that matches their C source without actually writing asm by hand for some reason. `-Og` sort of works for that, and would be fine if they'd used `size_t`, `ptrdiff_t`, or `uintptr_t` for their loop counter to avoid the `movslq` inside the loop. They don't want auto-vectorization with `paddd` to do four 32-bit adds in one instruction, they're trying to measure the CPU pipeline's throughput limits for scalar `add r32, r/m32`. — Peter Cordes, Jan 20 '23 at 16:24
@Lundin: Of course, looking at the asm should be step 1 for this kind of microbenchmarking, not where you turn for an explanation after the timing wasn't what you expected, so what I described in my previous comment as a valid approach to using `-Og` or `-O1`, or `-O2 -fno-tree-vectorize` might not be the mindset the OP was taking, and were just going to assume that the compiler had done what they expected if the numbers happened to have worked out... — Peter Cordes, Jan 20 '23 at 16:26

score 7 · Accepted Answer · answered Jan 20 '23 at 12:34

cmpl $127, %r9d is not a large iteration count compared to rdtsc overhead and the branch mispredict when you exit the loop, and time for the CPU to ramp up to max frequency.

Also, you want to measure core clock cycles, not TSC reference cycles. Put the loop in a static executable (for minimal startup overhead) and run it with perf stat to get core clocks for the whole process. (As in Can x86's MOV really be "free"? Why can't I reproduce this at all? or some perf experiments I've posted in other answers.)

See Idiomatic way of performance evaluation?

10M to 1000M total iterations is appropriate since that's still under a second and we only want to measure steady-state behaviour, not cold-cache or cold-branch-predictor effect. Or page-faults. Interrupt overhead tends to be under 1% on an idle system. Use perf stat --all-user to only count user-space cycles and instructions.

If you want to do it over an array (instead of just removing the pointer increment from the asm), do many passes over a small (16K) array so they all hit in L1d cache. Use a nested loop, or use an and to wrap an index.

Doing that, yes you should be able to measure the 3/clock throughput of add mem, reg on Zen3 and later, even if you leave in the movslq overhead and crap like that from compiler -Og output.

When you're truly micro-benchmarking to find out stuff about throughput of one form of one instruction, it's usually easier to write asm by hand than to coax a compiler into emitting the loop you want. (As long as you know enough asm to avoid pitfalls, e.g. .balign 64 before the loop just for good measure, to hopefully avoid front-end bottlenecks.)

See also https://uops.info/ for how they measure; for any given test, you can click on the link to see the asm loop body for the experiments they ran, and the raw perf counter outputs for each variation on the test. (Although I have to admit I forget what MPERF and APERF mean for AMD CPUs; the results for Intel CPUs are more obvious.) e.g. https://uops.info/html-tp/ZEN3/ADD_R32_M32-Measurements.html is the Zen3 results, which includes a test of 4 or 8 independent add reg, [r14+const] instructions as the inner loop body.

They also tested with an indexed addressing mode. With "With unroll_count=200 and no inner loop" they got identical results for MPERF / APERF / UOPS for 4 independent adds, with indexed vs. non-indexed addressing modes. (Their loops don't have a pointer increment.)

Thanks a lot. I just started learning about computer science, so your answer contains a lot of information for me. I'll take my time to learn all this new material. — drcxd, Jan 22 '23 at 03:29
@drcxd: Yup, microbenchmarking is not easy, and it can take some time to understand all the possible pitfalls. Using a large measurement interval will factor out all the small overheads like measurement, and other startup overheads due to CPUs being "lazy" to save power, like not running at max clock frequency all the time. And especially the fact that the TSC runs at a fixed frequency, regardless of the CPU core clock, because a low-overhead high-precision time source was more useful for OSes than an actual cycle counter. (Which you can still get with RDPMC if kernel programs the counters.) — Peter Cordes, Jan 22 '23 at 03:35
@drcxd: Decades ago, computers used to be a lot simpler. No power-saving clock changes, cache effects were a lot less pronounced, and throughput was lower compared to interrupt/exception overhead (so page faults were relatively less expensive, although the kernel would still have to zero a page for you). On computers without virtual memory, you wouldn't have even soft page faults, though. Jumping in now, there's a lot of complexity in real-world hardware, which is why intro courses simplify a lot of stuff, and/or work with an in-order MIPS pipeline from 1980. — Peter Cordes, Jan 22 '23 at 15:49
This simplified model makes me feel confused when things do not work as the textbook described. — drcxd, Jan 25 '23 at 11:03

Why cannot my program reach integer addition instruction throughput bound?

1 Answers1