Inconsistent `perf annotate` memory load/store time reporting

Question

I'm having a hard time interpreting Intel performance events reporting.

Consider the following simple program that mainly reads/writes memory:

#include <stdint.h>
#include <stdio.h>

volatile uint32_t a;
volatile uint32_t b;

int main() {
  printf("&a=%p\n&b=%p\n", &a, &b);
  for(size_t i = 0; i < 1000000000LL; i++) {
    a ^= (uint32_t) i;
    b += (uint32_t) i;
    b ^= a;
  }
  return 0;
}

I compile it with gcc -O2 and run under perf:

# gcc -g -O2 a.c
# perf stat -a ./a.out
&a=0x55a4bcf5f038
&b=0x55a4bcf5f034

 Performance counter stats for 'system wide':

         32,646.97 msec cpu-clock                 #   15.974 CPUs utilized
               374      context-switches          #    0.011 K/sec
                 1      cpu-migrations            #    0.000 K/sec
                 1      page-faults               #    0.000 K/sec
    10,176,974,023      cycles                    #    0.312 GHz
    13,010,322,410      instructions              #    1.28  insn per cycle
     1,002,214,919      branches                  #   30.699 M/sec
           123,960      branch-misses             #    0.01% of all branches

       2.043727462 seconds time elapsed
# perf record -a ./a.out
&a=0x5589cc1fd038
&b=0x5589cc1fd034
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.997 MB perf.data (9269 samples) ]
# perf annotate

The result of perf annotate (annotated for memory loads/stores by me):

Percent│      for(size_t i = 0; i < 1000000000LL; i ++) {
       │      xor    %eax,%eax
       │      nop
       │            a ^= (uint32_t) i;
       │28:   mov    a,%edx                             // 32-bit load
       │      xor    %eax,%edx
  9.74 │      mov    %edx,a                             // 32-bit store
       │            b += (uint32_t) i;
 12.12 │      mov    b,%edx                             // 32-bit load
  8.79 │      add    %eax,%edx
       │      for(size_t i = 0; i < 1000000000LL; i ++) {
       │      add    $0x1,%rax
       │            b += (uint32_t) i;
 18.69 │      mov    %edx,b                             // 32-bit store
       │            b ^= a;
  0.04 │      mov    a,%ecx                             // 32-bit load
 22.39 │      mov    b,%edx                             // 32-bit load
  8.92 │      xor    %ecx,%edx
 19.31 │      mov    %edx,b                             // 32-bit store
       │      for(size_t i = 0; i < 1000000000LL; i ++) {
       │      cmp    $0x3b9aca00,%rax
       │    ↑ jne    28
       │      }
       │      return 0;
       │    }
       │      xor    %eax,%eax
       │      add    $0x8,%rsp
       │    ← retq

My observations:

From the 1.28 insn per cycle I conclude that the program is mainly memory-bound.
a and b appear to be located in the same cache line, adjacent to each other.

My Question:

Shouldn't CPU time be more consistent for the various memory loads and stores?
Why is CPU time of the 1st memory load (mov a,%edx) zero?
Why is the time of the 3rd load mov a,%ecx 0.04%, while the one just next to it mov b,%edx 22.39%?
Why do some instructions take 0 time? The loop consists of 14 instructions, so each instruction must contribute some observable time.

Notes:

OS: Linux 4.19.0-amd64, CPU: Intel Core i9-9900K, 100% idle system (also tested on i7-7700, same result).

Related: [Unexpectedly slow performance on moving from register to frequently accessed variable](https://stackoverflow.com/q/76773814) is a simpler case, just a store waiting for an L3-miss load+add, with the store getting all the "blame". — Peter Cordes, Jul 26 '23 at 18:35

score 5 · Accepted Answer · answered Jan 26 '21 at 18:40

Not exactly "memory" bound, but bound on latency of store-forwarding. i9-9900K and i7-7700 have exactly the same microarchitecture for each core so that's not surprising :P https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Key_changes_from_Kaby_Lake. (Except possibly for improvement in hardware mitigation of Meltdown, and possibly fixing the loop buffer (LSD).)

Remember that when a perf event counter overflows and triggers a sample, the out-of-order superscalar CPU has to choose exactly one of the in-flight instructions to "blame" for this cycles event. Often this is the oldest un-retired instruction in the ROB, or the one after. Be very suspicious of cycles event samples over very small scales.

Perf never blames a load that was slow to produce a result, usually the instruction that was waiting for it. (In this case an xor or add). Here, sometimes the store consuming the result of that xor. These aren't cache-miss loads; store-forwarding latency is only about 3 to 5 cycles on Skylake (variable, and shorter if you don't try too soon: Loop with function call faster than an empty loop) so you do have loads completing at about 2 per 3 to 5 cycles.

You have two dependency chains through memory

The longest one involving two RMWs of b. This is twice as long and will be the overall bottleneck for the loop.
The other involving one RMW of a (with an extra read each iteration which can happen in parallel with the read that's part of the next a ^= i;).

The dep chain for i only involves registers and can run far ahead of the others; it's no surprise that add $0x1,%rax has no counts. Its execution cost is totally hidden in the shadow of waiting for loads.

I'm a bit surprised there are significant counts for mov %edx,a. Perhaps it sometimes has to wait for the older store uops involving b to run on the CPUs single store-data port. (Uops are dispatched to ports according to oldest-ready first. How are x86 uops scheduled, exactly?)

Uops can't retire until all previous uops have executed, so it could just be getting some skew from the store at the bottom of the loop. Uops retire in groups of 4, so if the mov %edx,b does retire, the already-executed cmp/jcc, the mov load of a, and the xor %eax,%edx can retire with it. Those are not part of the dep chain that waits for b, so they're always going to be sitting in the ROB waiting to retire whenever the b store is ready to retire. (This is guesswork about how mov %edx,a could be getting counts, despite not being part of a real bottleneck.)

The store-address uops should all run far ahead of the loop because they don't have to wait for previous iterations: RIP-relative addressing¹ is ready right away. And they can run on port 7, or compete with loads for ports 2 or 3. Same for the loads: they can execute right away and detect what store they're waiting for, with the load buffer monitoring it and ready to report when the data becomes ready after the store-data uop does eventually run.

Presumably the front-end will eventually bottleneck on allocating load buffer entries, and that's what will limit how many uops can be in the back-end, not ROB or RS size.

Footnote 1: Your annotated output only shows a not a(%rip) so that's odd; doesn't matter if somehow you did get it to use 32-bit absolute, or if it's just a disassembly quirk failing to show RIP-relative.

Thanks. Still not fully understanding the timings. According to the stats it should be 10 cycles/iteration, there's practically no variation in branching or data dependency, so I would expect the time "step" to be a multiple of 10%. So then what could explain numbers like 12% and 18%? — rustyx, Jan 26 '21 at 20:49
@rustyx: Retirement is bursty. When the oldest uop finally executes and retires, that allows more groups of 4 to retire in successive clock cycles if they were already executed and just sitting in the ROB waiting to retire. Note that the dep chain involving `b` (the critical path) is mostly toward the bottom of the loop, and most of the counts are in instructions that are part of it. Related: [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) — Peter Cordes, Jan 26 '21 at 20:54
@rustyx The unit of percentages is samples, not cycles. 12.12% is the portion of all samples captured when the instruction pointer was pointing at the respective instruction. These percentages generally have no correlation with how much each instruction has impacted wall-clock execution time, which is obvious in this program (Peter has already explained that the critical path mostly consists of the two sequential store-to-load forwards involving variable `b`, which is basically why you're getting 10c/iter). Even if these percents were cycles, I don't know why you're expecting only to see 10%s. — Hadi Brais, Jan 27 '21 at 07:13
@HadiBrais: To be fair, a slow instruction like `div`, or one that consumes the result of a cache-miss load, often will get a lot of counts for the cycles event. But anyway, on 2nd look at Rusty's comment, I think the expectation is that perf will chop up one iteration (which takes 10 cycles) into its component costs, so round numbers like 20% and 10%. And the reason it's not like that is that it's averaged over many iterations of the same loop (10 billion c), not just 10 cycles, and @ rusty the mechanism by which it samples is event counters overflowing, triggering recording of a sample. — Peter Cordes, Jan 27 '21 at 07:40
@PeterCordes OK does that mean the reported percentages are inaccurate (e.g. due to "burstiness") or is there some variation between iterations? note: I get these percentages regardless of # iterations. Also tried IACA and llvm-mca to no avail, as they don't seem to support store-forwarding timings. — rustyx, Jan 27 '21 at 14:03
@rustyx: Well they're "inaccurate" in some sense, or at least reflect the mechanics of how the CPU chooses to blame a specific instruction when a cycles-based perf event fires. But not because of burstiness, instead because you're sampling across billions or millions of iterations. Not one iteration: if you even tried to do that, interrupt overhead would totally distort everything and you wouldn't be measuring what really happens when out-of-order execution has its way with your loop. (Events like `instructions` or `mem_trans_retired.load_latency_gt_4` have a specific insn. See `perf list`). — Peter Cordes, Jan 27 '21 at 18:27

Inconsistent `perf annotate` memory load/store time reporting

1 Answers1

Linked