3

Whiskey Lake i7-8565U

The RESOURCE_STALLS.OTHER does not look like a well-explained by the Intel docs:

Counts the number of cycles while execution was stalled due to other resource issues.

I ran the experiments on an example of memory copy of 16MiB randomly generated data in a loop consisting of 6400 iterations.


Baseline:

avx_memcpy_baseline:
    shr rdx, 0x3
    xor rcx, rcx
avx_memcpy_baseline_loop:
    add rcx, 0x08
    cmp rdx, rcx
    ja avx_memcpy_baseline_loop
    ret

Baseline counters:

   823 292 269      resource_stalls.any
       181 045      r02a2 #LOAD
   831 370 403      r04a2 #RS_FULL
        49 659      resource_stalls.sb
       130 100      r10a2 #ROB_FULL
        63 386      r20a2 #FPCW
     2 151 516      r40a2 #MSCXR
         4 222      r80a2 #OTHER  

WB Stores:

avx_memcpy_forward_llss:
    shr rdx, 0x3
    xor rcx, rcx
avx_memcpy_forward_loop_llss:
    vmovdqa ymm0, [rsi + 8*rcx]
    vmovdqa ymm1, [rsi + 8*rcx + 0x20]
    vmovdqa [rdi + rcx*8], ymm0
    vmovdqa [rdi + rcx*8 + 0x20], ymm1
    add rcx, 0x08
    cmp rdx, rcx
    ja avx_memcpy_forward_loop_llss
    ret

WB Stores counters:

27 089 245 473      resource_stalls.any
     4 873 836      r02a2  #LOAD                                                                                                                                          
    14 099 696      r04a2  #RS_FULL                                                                                                                                          
24 130 341 296      resource_stalls.sb                                                                                                                                                               
     5 790 969      r10a2  #ROB_FULL                                                                                                                                               
       375 032     r20a2   #FPCW                                                                                                                                                      
     3 395 592      r40a2  #MXCSR
 4 899 892 032      r80a2   #resource_stalls.other 14% of RESOURCE_STALL.ANY

NT Stores:

avx_nt_memcpy_forward_llss:
    shr rdx, 0x3
    xor rcx, rcx
avx_nt_memcpy_forward_loop_llss:
    vmovdqa ymm0, [rsi + 8*rcx]
    vmovdqa ymm1, [rsi + 8*rcx + 0x20]
    vmovntdq [rdi + rcx*8], ymm0
    vmovntdq [rdi + rcx*8 + 0x20], ymm1
    add rcx, 0x08
    cmp rdx, rcx
    ja avx_nt_memcpy_forward_loop_llss
    ret

NT Stores counters:

18 121 917 993      resource_stalls.any
     2 211 195      r02a2 #LOAD
     5 588 784      r04a2 #RS_FULL
12 061 475 989      resource_stalls.sb
     3 156 129      r10a2 #ROB_FULL
       165 967     r20a2  #FPCW
     2 152 595      r40a2  #MXCSR                                                       
 6 730 668 837      r80a2 #resource_stalls.other 33% of RESOURCE_STALLS.ANY   

It is quite very noticeable in case of NonTemporal Stores where it took 1/3 of all resource stalls so I'm curious to know what the RESOURCE_STALLS.OTHER might mean when profiling memory bound routines on Skylake or later.

St.Antario
  • 26,175
  • 41
  • 130
  • 318
  • I assume it includes the RS or ROB being full, e.g. because of an old cache-miss load that can't retire until data arrives. And of course there are other microarchitectural resources like the [branch order buffer that enables fast recovery](https://stackoverflow.com/questions/50984007/what-exactly-happens-when-a-skylake-cpu-mispredicts-a-branch) from mispredicts without waiting for them to reach retirement. I think if it's full, a branch can't issue into to back-end. – Peter Cordes Feb 17 '20 at 22:20
  • 1
    @PeterCordes _I assume it includes the RS or ROB being full, e.g. because of an old cache-miss load that can't retire until data arrives._ I'm not sure that ROB and RS are included in the counter since they have a separate Umask. I added all the Umasks available in the Intel docs. – St.Antario Feb 18 '20 at 07:07
  • 1
    Oh, yeah your updated data does seem to indicate that's unlikely; the counts for ROB_FULL and RS_FULL are way too low to add up to that, so they don't account for most. (Assuming these events are really measuring what we think based on the names / docs). I'm surprised `perf` doesn't have named events for more of those different specific `resource_stalls`. I haven't used `ocperf.py` for a while, maybe it knows about those. – Peter Cordes Feb 18 '20 at 07:12

1 Answers1

5

Intel has only documented two resource-related stalls on your processor, namely RESOURCE_STALLS.ANY and RESOURCE_STALLS.SB. The other events are documented on Nehalem/Westmere, but that doesn't mean they'll work accurately on Skylake. You'll have to validate them before trying to make sense out of the event counts. At the very least, we have to check whether RESOURCE_STALLS.ANY is equal to the sum of RESOURCE_STALLS.SB and the other undocumented events. It looks like they do add up. (IIRC, about two years ago, I was in situation where I had to validate some of these undocumented events on Haswell, but I can't remember now which events, unfortunately.)

The Intel manual describes RESOURCE_STALLS.ANY on Skylake as follows:

Countsresource-related stall cycles.Reasons for stalls can be as follows:
a. any u-arch structure got full (LB, SB, RS, ROB,BOB,LM, Physical Register Reclaim Table (PRRT), or Physical History Table (PHT) slots).
b. any u-arch structure got empty (like INT/SIMD FreeLists).
c. FPU control word (FPCW), MXCSR.and others.
This counts cycles that the pipeline back-end blocked uop delivery from the front-end.

This description provides a partial list of categories of resource-related stalls, rather than specific stall reasons. For example, the RS category includes many stall reasons that are specific to the RS. These exist in most of the Intel's out-of-order microarchitectures, but the specific stall reasons can vary significantly on different microarchitectures. The relative importance of each category in terms of its impact on performance also depends on the microarchitecture. This categorization is convenient from an analysis point of view.

Notice that many of the stall reasons for which performance events were documented on old microarchitetures are now simply mentioned under RESOURCE_STALLS.ANY, which means that they still exist even if the corresponding events are not documented.

Here is a brief description of each of these categories applicable to all out-of-order microarchitectures:

  • LB: The load buffer holds load uops and other uops that are executed on the load pipe. This category includes stall reasons specific to the LB. When the allocator cannot allocate an LB entry for any reason, an LB stall occurs.
  • SB: The store buffer holds STA, STD, and other uops that are executed on the store pipe. This category includes stall reasons specific to the SB. When the allocator cannot allocate an SB entry for any reason, an SB stall occurs.
  • RS: This holds all non-completed uops. The RS could be distributed or unified, depending on the microarchitecture. In both design, RS-related stalls fall in this category.
  • ROB: This holds all uops to retire them in program order.
  • BOB: The branch order buffer associates the register state with each speculated branch (conditional or indirect) to enable fast misprediction recovery.
  • LM: The load matrix tracks register dependencies between any uop in the RS and all load uops in the RS (i.e. a uop takes as input a physical register that is the destination of a load uop that precedes in program order). The LM can become full before the LB when there are too many uops that are dependent on a small number of loads. If there are few dependencies but too many loads, then the LB may become full first.
  • PRRT: Each time a uop that modifies a physical register retires, the Physical Register Reclaim Table is updated to specify that the physical register that used to map the old version of the same architectural register can now be reclaimed (because now there is a new mapping for that register). This structure tracks allocated physical registers. If the allocator requires allocating a physical register, there is has to be a free entry in the PRRT. Otherwise, it stalls.
  • PHT: This tracks all current mappings of each architectural register to one or more physical register. This structure is used to support fast branch recovery.
  • INT and SIMD Free Lists: There is logic that reclaims registers based on information from the PRRT. When a physical register is reclaimed, it's added to a structure called the free list, which effectively makes free for allocation. There are two free lists, one for GP registers and the other for SIMD registers. These lists are used by the allocator to know which registers are free. Stalls related to the availability of physical registers fall in this category.
  • FPCW: An instruction that writes to the floating-point control word, such as FLDCW, may stall the pipeline until all earlier uops complete execution. The conditions depend on the microarchitecture and the FPCW bits that are modified (see Section 3.8.3 of the Intel optimization manual). These stalls are accounted here.
  • MXCSR: This is similar to FLDCW. An instruction that writes to the MXCSR register, such as LDMXCSR, may stall the pipeline until all earlier uops complete execution. A microarchitecture may rename the MXCSR, but if not then it has to finish older math instructions before changing the rounding mode, for example.
  • Others: There are many other stall reasons that don't fall in any of the previous categories. Intel has decided to not mention them.

The event you call RESOURCE_STALLS.OTHER includes the following categories: BOB, LM, PRRT, PHT, free lists, and others. I think you're stalling on the LM. Try changing the loads to non-memory instructions that write the same destination registers and see whether RESOURCE_STALLS.OTHER becomes negligible.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • 1
    Very useful answer Hadi! Can I ask you where did you learn all these CPU internals? Is it all scattered around the web (e.g. this site and the Intel forum) or there is a more-or-less reference (like the optimization manuals, which I never read carefully)? – Margaret Bloom Feb 21 '20 at 15:00
  • 1
    It seems the description that you quoted from the Intel Manual were added in a recent version. In Oct,2016 version of the manual it was documented poorly. `RESOURCE_STALLS2` counts some of those events, `LM` is not presented, unfortunately. I measured all `RESOURCE_STALLS2.ALL_FL_EMPTY,BOB_FULL,OOO_RS_RC,ALL_PRF_CONTROL` and none of those counters had significant impact. It seems you were right about `LM`, eliminating stores reduced `OTHER` stalls by the order of magnitude: `12 197 524 972 resource_stalls.any` and `34 877 r80a2 #RESOURCE_STALLS.OTHER` – St.Antario Feb 22 '20 at 10:31
  • The question is where did you find the description of load matrix. Agner Fog does not seem to document it in his [microarchitecture guide](https://www.agner.org/optimize/microarchitecture.pdf). – St.Antario Feb 22 '20 at 10:35