98

Does anybody know what is the meaning of stalled-cycles-frontend and stalled-cycles-backend in perf stat result ? I searched on the internet but did not find the answer. Thanks

$ sudo perf stat ls                     

Performance counter stats for 'ls':

      0.602144 task-clock                #    0.762 CPUs utilized          
             0 context-switches          #    0.000 K/sec                  
             0 CPU-migrations            #    0.000 K/sec                  
           236 page-faults               #    0.392 M/sec                  
        768956 cycles                    #    1.277 GHz                    
        962999 stalled-cycles-frontend   #  125.23% frontend cycles idle   
        634360 stalled-cycles-backend    #   82.50% backend  cycles idle
        890060 instructions              #    1.16  insns per cycle        
                                         #    1.08  stalled cycles per insn
        179378 branches                  #  297.899 M/sec                  
          9362 branch-misses             #    5.22% of all branches         [48.33%]

   0.000790562 seconds time elapsed
VAndrei
  • 5,420
  • 18
  • 43
Dafan
  • 1,186
  • 1
  • 11
  • 14
  • 1
    I am not sure what the real question is here. Are asking what the frontend and backend of a CPU are? Please read this very [high level introduction](http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=98&Itemid=1&limit=1&limitstart=8). Does this answer your question? – Ali Mar 04 '14 at 12:23
  • I searched and search for a similar answer... This was the most helpful resource I found from Intel: https://software.intel.com/en-us/articles/how-to-tune-applications-using-a-top-down-characterization-of-microarchitectural-issues – Jmoney38 Nov 05 '15 at 21:51
  • No, almost no one knows what those really mean. But referencing the manual (as in Manuel Selva's answer) combined with this post (which I don't fully understand yet), are the closest I've found: https://sites.utexas.edu/jdm4372/2014/06/04/counting-stall-cycles-on-the-intel-sandy-bridge-processor/ – jberryman Feb 15 '19 at 22:21

4 Answers4

80

The theory:

Let's start from this: nowaday's CPU's are superscalar, which means that they can execute more than one instruction per cycle (IPC). Latest Intel architectures can go up to 4 IPC (4 x86 instruction decoders). Let's not bring macro / micro fusion into discussion to complicate things more :).

Typically, workloads do not reach IPC=4 due to various resource contentions. This means that the CPU is wasting cycles (number of instructions is given by the software and the CPU has to execute them in as few cycles as possible).

We can divide the total cycles being spent by the CPU in 3 categories:

  1. Cycles where instructions get retired (useful work)
  2. Cycles being spent in the Back-End (wasted)
  3. Cycles spent in the Front-End (wasted).

To get an IPC of 4, the number of cycles retiring has to be close to the total number of cycles. Keep in mind that in this stage, all the micro-operations (uOps) retire from the pipeline and commit their results into registers / caches. At this stage you can have even more than 4 uOps retiring, because this number is given by the number of execution ports. If you have only 25% of the cycles retiring 4 uOps then you will have an overall IPC of 1.

The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, reciprocals, divisions, etc.).

The cycles stalled in the front-end are a waste because that means that the Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior.

Another stall reason is branch prediction miss. That is called bad speculation. In that case uOps are issued but they are discarded because the BP predicted wrong.

The implementation in profilers:

How do you interpret the BE and FE stalled cycles?

Different profilers have different approaches on these metrics. In vTune, categories 1 to 3 add up to give 100% of the cycles. That seams reasonable because either you have your CPU stalled (no uOps are retiring) either it performs usefull work (uOps) retiring. See more here: https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm

In perf this usually does not happen. That's a problem because when you see 125% cycles stalled in the front end, you don't know how to really interpret this. You could link the >1 metric with the fact that there are 4 decoders but if you continue the reasoning, then the IPC won't match.

Even better, you don't know how big the problem is. 125% out of what? What do the #cycles mean then?

I personally look a bit suspicious on perf's BE and FE stalled cycles and hope this will get fixed.

Probably we will get the final answer by debugging the code from here: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/perf/builtin-stat.c

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
VAndrei
  • 5,420
  • 18
  • 43
  • What events are used in VTune as FE and BE? Manuel posted events from perf on Sandy Bridge. Sometimes decoder can't decode 4 instructions (http://www.realworldtech.com/sandy-bridge/4/ - there are 3 simple decoders which can't decode complex commands). – osgx Mar 15 '15 at 15:53
  • It's true there is also a complex decoder but it may be also able to decode simple instructions. I updated my post with a link to vTune counters. It uses the same counters as perf but i think vTune combines differently. – VAndrei Mar 15 '15 at 16:42
  • 4
    Vtune uses https://software.intel.com/en-us/articles/how-to-tune-applications-using-a-top-down-characterization-of-microarchitectural-issues "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS" as "Frontend bound" and "1 - (Front-End Bound + Retiring + Bad Speculation)" as "Backend bound" where "Retiring = UOPS_RETIRED.RETIRE_SLOTS / SLOTS", "Bad Speculation=(UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4*INT_MISC.RECOVERY_CYCLES) / SLOTS" and "SLOTS=4* CPU_CLK_UNHALTED.THREAD" with 4 equal to "the machine pipeline width". – osgx Mar 15 '15 at 17:17
  • 1
    And for Sandy Bridge Intel's Optimization manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf gives the same in "B.3.2 Hierarchical Top-Down Performance Characterization Methodology" "%FE_Bound = 100 * (IDQ_UOPS_NOT_DELIVERED.CORE / N ) ; %Bad_Speculation = 100 * ( (UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / N) ; %Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ; %BE_Bound = 100 * (1 – (FE_Bound + Retiring + Bad_Speculation) ) ; N = 4*CPU_CLK_UNHALTED.THREAD" – osgx Mar 15 '15 at 17:27
  • @osgx Thanks. Now we know what do the metrics mean in vTune and that they add up to 100%. The next question is why does perf compute them differently? Is it a bug or is there a meaning behind it? – VAndrei Mar 15 '15 at 21:06
  • VAndrei, one of patches: http://lkml.iu.edu/hypermail/linux/kernel/1105.0/02486.html "[PATCH] perf events, x86: Add SandyBridge stalled-cycles-frontend/backend events" Lin Ming @May 06 2011 "**As commit 3011203 says, these are only approximations.**"; In 2011 status Eranian said: http://cscads.rice.edu/workshops/summer-2011/slides/performance-tools/Eranian-perf_events-CScADS-2011.pdf "Generic stall events ● two new generic PMU events: ○PERF_COUNT_HW_STALLED_CYCLES_FRONTEND ○PERF_COUNT_HW_STALLED_CYCLES_BACKEND ○ ***no clear definitions***". commit 8f62242246351b5a4bc0c1f00c0c7003edea128a – osgx Mar 15 '15 at 22:16
  • Don't know how inv=1 is implemented (computed) for "UOPS_ISSUED.ANY" (actual settings: `Cmask = 1, Inv = 1, Any= 1` - UOPS_ISSUED.CORE_STALL_CYCLES - "Cycles where no uops were issued to the OOO backend of the pipleine by either logical thread"). – osgx Mar 15 '15 at 22:30
49

To convert generic events exported by perf into your CPU documentation raw events you can run:

more /sys/bus/event_source/devices/cpu/events/stalled-cycles-frontend 

It will show you something like

event=0x0e,umask=0x01,inv,cmask=0x01

According to the Intel documentation SDM volume 3B (I have a core i5-2520):

UOPS_ISSUED.ANY:

  • Increments each cycle the # of Uops issued by the RAT to RS.
  • Set Cmask = 1, Inv = 1, Any= 1 to count stalled cycles of this core.

For the stalled-cycles-backend event translating to event=0xb1,umask=0x01 on my system the same documentation says:

UOPS_DISPATCHED.THREAD:

  • Counts total number of uops to be dispatched per- thread each cycle
  • Set Cmask = 1, INV =1 to count stall cycles.

Usually, stalled cycles are cycles where the processor is waiting for something (memory to be feed after executing a load operation for example) and doesn't have any other stuff to do. Moreover, the frontend part of the CPU is the piece of hardware responsible to fetch and decode instructions (convert them to UOPs) where as the backend part is responsible to effectively execute the UOPs.

Manuel Selva
  • 18,554
  • 22
  • 89
  • 134
  • thanks for your reply. so what is the different between stalled and idle? – Dafan Mar 08 '14 at 13:40
  • 2
    Stalled and idle are the same. CPU is idle because its stalled as the instruction pipeline is not moving. – Milind Dumbare Mar 12 '14 at 22:28
  • @Milind, shouldn't there should be a difference, stalled should be "we don't progress because the next stage doesn't allow it", and idle should be "there is nothing to process"? – Surt Nov 18 '14 at 14:56
14

A CPU cycle is “stalled” when the pipeline doesn't advance during it.

Processor pipeline is composed of many stages: the front-end is a group of these stages which is responsible for the fetch and decode phases, while the back-end executes the instructions. There is a buffer between front-end and back-end, so when the former is stalled the latter can still have some work to do.

Taken from http://paolobernardi.wordpress.com/2012/08/07/playing-around-with-perf/

Milind Dumbare
  • 3,104
  • 2
  • 19
  • 32
11

According to author of these events, they defined loosely and are approximated by available CPU performance counters. As I know, perf doesn't support formulas to calculate some synthetic event based on several hardware events, so it can't use front-end/back-end stall bound method from Intel's Optimization manual (Implemented in VTune) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf "B.3.2 Hierarchical Top-Down Performance Characterization Methodology"

%FE_Bound = 100 * (IDQ_UOPS_NOT_DELIVERED.CORE / N ); 
%Bad_Speculation = 100 * ( (UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / N) ; 
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ; 
%BE_Bound = 100 * (1 – (FE_Bound + Retiring + Bad_Speculation) ) ; 
N = 4*CPU_CLK_UNHALTED.THREAD" (for SandyBridge)

Right formulas can be used with some external scripting, like it was done in Andi Kleen's pmu-tools (toplev.py): https://github.com/andikleen/pmu-tools (source), http://halobates.de/blog/p/262 (description):

% toplev.py -d -l2 numademo  100M stream
...
perf stat --log-fd 4 -x, -e
{r3079,r19c,r10401c3,r100030d,rc5,r10e,cycles,r400019c,r2c2,instructions}
{r15e,r60006a3,r30001b1,r40004a3,r8a2,r10001b1,cycles}
numademo 100M stream
...
BE      Backend Bound:                      72.03%
    This category reflects slots where no uops are being delivered due to a lack
    of required resources for accepting more uops in the    Backend of the pipeline.
 .....
FE      Frontend Bound:                     54.07%
This category reflects slots where the Frontend of the processor undersupplies
its Backend.

Commit which introduced stalled-cycles-frontend and stalled-cycles-backend events instead of original universal stalled-cycles:

http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=8f62242246351b5a4bc0c1f00c0c7003edea128a

author  Ingo Molnar <mingo@el...>   2011-04-29 11:19:47 (GMT)
committer   Ingo Molnar <mingo@el...>   2011-04-29 12:23:58 (GMT)
commit  8f62242246351b5a4bc0c1f00c0c7003edea128a (patch)
tree    9021c99956e0f9dc64655aaa4309c0f0fdb055c9
parent  ede70290046043b2638204cab55e26ea1d0c6cd9 (diff)

perf events: Add generic front-end and back-end stalled cycle event definitions Add two generic hardware events: front-end and back-end stalled cycles.

These events measure conditions when the CPU is executing code but its capabilities are not fully utilized. Understanding such situations and analyzing them is an important sub-task of code optimization workflows.

Both events limit performance: most front end stalls tend to be caused by branch misprediction or instruction fetch cachemisses, backend stalls can be caused by various resource shortages or inefficient instruction scheduling.

Front-end stalls are the more important ones: code cannot run fast if the instruction stream is not being kept up.

An over-utilized back-end can cause front-end stalls and thus has to be kept an eye on as well.

The exact composition is very program logic and instruction mix dependent.

We use the terms 'stall', 'front-end' and 'back-end' loosely and try to use the best available events from specific CPUs that approximate these concepts.

Cc: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: Frederic Weisbecker Link: http://lkml.kernel.org/n/tip-7y40wib8n000io7hjpn1dsrm@git.kernel.org Signed-off-by: Ingo Molnar

    /* Install the stalled-cycles event: UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 */
-       intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES] = 0x1803fb1;
+       intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x1803fb1;

-   PERF_COUNT_HW_STALLED_CYCLES        = 7,
+   PERF_COUNT_HW_STALLED_CYCLES_FRONTEND   = 7,
+   PERF_COUNT_HW_STALLED_CYCLES_BACKEND    = 8,
osgx
  • 90,338
  • 53
  • 357
  • 513
  • So in the end is it an error in perf? Because FE + BE + ? don't add to a known theoretical value, it's hard to assess how big is the problem of your code. When you see 75% FE stalling that needs to be compared with something. Saying 75% out of 100% the code is stalled in the FE or BE has a whole different meaning and value. From what I see, even toplev.py has the same issue. If this is not an issue, how do we interpret the metrics? What makes the metrics high or low? – VAndrei Mar 17 '15 at 18:52
  • VAndrei, do you have short and reproducible example for SandyBridge (+-1 generation); both for `perf stat` with FE > 100% and for toplev.py? I just started from short simple loops and have 3G cycles for 3G instructions (1G are branches with 0.00% miss rate) with 2G FE stalls (`perf stat`) and 1G BE stalls (IPC=1.00). I think the problem is to correctly define "stall" for complex OOO core and another is to correctly interpret `toplev.py` results. – osgx Mar 17 '15 at 20:18
  • The code I posted here: http://stackoverflow.com/questions/28961405/is-there-a-code-that-results-in-50-branch-prediction-miss should be front end bound. There's a lot of branch misses in it so that would generate FE stalls. Regarding BE bound you need a workload that waits from RAM data. Allocate 1/2 of your physical memory size in a buffer and use a LCG (like in my code) to do a read/modify/write operation at a random location in the buffer. That generates a small number of instructions besides the RMW transaction and the core will stall in the BE waiting from RAM data. – VAndrei Mar 18 '15 at 17:30
  • Generating FE bound workloads is quite a challenge. Please try if the branching microbenchmark works but if not you need something more complex. The FE stall would be generated by high number instruction cache misses. In order to do that, you need a large code with far jumps through it to lead to multiple I$ misses. I don't have at this point an ideea on how making a FE bound workload in a microbenchmark. – VAndrei Mar 18 '15 at 17:35
  • I think you would be interested in this link: http://stackoverflow.com/questions/1756825/how-can-i-do-a-cpu-cache-flush You can use some of those discussed techniques to flush the I$ and therefore generate FE stalls. – VAndrei Mar 19 '15 at 14:14