When running toplev, from pmu-tools on a piece of software (compiled with gcc: gcc -g -O3) I get this output:
FE Frontend_Bound: 37.21 +- 0.00 % Slots
BAD Bad_Speculation: 23.62 +- 0.00 % Slots
BE Backend_Bound: 7.33 +- 0.00 % Slots below
RET Retiring: 31.82 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency: 26.55 +- 0.00 % Slots
FE Frontend_Bound.Frontend_Bandwidth: 10.62 +- 0.00 % Slots
BAD Bad_Speculation.Branch_Mispredicts: 23.72 +- 0.00 % Slots
BAD Bad_Speculation.Machine_Clears: 0.01 +- 0.00 % Slots below
BE/Mem Backend_Bound.Memory_Bound: 1.59 +- 0.00 % Slots below
BE/Core Backend_Bound.Core_Bound: 5.73 +- 0.00 % Slots below
RET Retiring.Base: 31.54 +- 0.00 % Slots below
RET Retiring.Microcode_Sequencer: 0.28 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency.ICache_Misses: 0.70 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 0.62 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.04 +- 0.00 % Clocks_Estimated <==
FE Frontend_Bound.Frontend_Latency.DSB_Switches: 0.57 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.LCP: 0.00 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.MS_Switches: 0.76 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Bandwidth.MITE: 0.36 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.DSB: 26.79 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.LSD: 0.00 +- 0.00 % CoreClocks below
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 6.53 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L2_Bound: -0.03 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 0.37 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 2.46 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 0.22 +- 0.00 % Stalls below
BE/Core Backend_Bound.Core_Bound.Divider: 0.01 +- 0.00 % Clocks below
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 28.53 +- 0.00 % Clocks below
RET Retiring.Base.FP_Arith: 0.02 +- 0.00 % Uops below
RET Retiring.Base.Other: 99.98 +- 0.00 % Uops below
RET Retiring.Microcode_Sequencer.Assists: 0.00 +- 0.00 % Slots_Estimated below
MUX: 100.00 +- 0.00 %
warning: 6 results not referenced: 67 71 72 85 87 88
This binary takes around 4.7 seconds to run.
If I add the following flag to gcc: -falign-loops=32, the binary takes now around 3.8 seconds to run, and this is the output from toplev:
FE Frontend_Bound: 17.47 +- 0.00 % Slots below
BAD Bad_Speculation: 28.55 +- 0.00 % Slots
BE Backend_Bound: 12.02 +- 0.00 % Slots
RET Retiring: 34.21 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency: 6.10 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Bandwidth: 11.31 +- 0.00 % Slots below
BAD Bad_Speculation.Branch_Mispredicts: 29.19 +- 0.00 % Slots <==
BAD Bad_Speculation.Machine_Clears: 0.01 +- 0.00 % Slots below
BE/Mem Backend_Bound.Memory_Bound: 4.58 +- 0.00 % Slots below
BE/Core Backend_Bound.Core_Bound: 7.44 +- 0.00 % Slots below
RET Retiring.Base: 33.70 +- 0.00 % Slots below
RET Retiring.Microcode_Sequencer: 0.50 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency.ICache_Misses: 0.55 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 0.58 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.72 +- 0.00 % Clocks_Estimated below
FE Frontend_Bound.Frontend_Latency.DSB_Switches: 0.17 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.LCP: 0.00 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.MS_Switches: 0.40 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Bandwidth.MITE: 0.68 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.DSB: 42.01 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.LSD: 0.00 +- 0.00 % CoreClocks below
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 7.60 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L2_Bound: -0.04 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 0.70 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 0.71 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 1.85 +- 0.00 % Stalls below
BE/Core Backend_Bound.Core_Bound.Divider: 0.02 +- 0.00 % Clocks below
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 17.38 +- 0.00 % Clocks below
RET Retiring.Base.FP_Arith: 0.02 +- 0.00 % Uops below
RET Retiring.Base.Other: 99.98 +- 0.00 % Uops below
RET Retiring.Microcode_Sequencer.Assists: 0.00 +- 0.00 % Slots_Estimated below
MUX: 100.00 +- 0.00 %
warning: 6 results not referenced: 67 71 72 85 87 88
By adding that flag, the Frontend Latency has improved (as we can see from the toplev output). I understand that by adding that flag, now the loops are aligned to 32 bytes, and the DSB is hit more frequently when running tight loops (the code spends its time mostly in a couple of small loops). However I don't understand why the metric Frontend_Bound.Frontend_Bandwidth.DSB has gone up (the description for that metric is: "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline"). I would have expected that metric to go down, as the use of the DSB is precisely what I'm improving by adding the gcc flag.
PS: when running toplev I've used --no-multiplex, to minimize errors caused by multiplexing. The target architecture is Broadwell, and the assembly of the loops is the following (Intel syntax):
606: eb 15 jmp 61d <main+0x7d>
608: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
60f: 00
610: 48 83 c6 01 add rsi,0x1
614: 48 81 fe 01 20 00 00 cmp rsi,0x2001
61b: 74 ad je 5ca <main+0x2a>
61d: 41 80 3c 30 00 cmp BYTE PTR [r8+rsi*1],0x0
622: 74 ec je 610 <main+0x70>
624: 48 8d 0c 36 lea rcx,[rsi+rsi*1]
628: 48 81 f9 00 20 00 00 cmp rcx,0x2000
62f: 77 20 ja 651 <main+0xb1>
631: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
636: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
63d: 00 00 00
640: 41 c6 04 08 00 mov BYTE PTR [r8+rcx*1],0x0
645: 48 01 f1 add rcx,rsi
648: 48 81 f9 00 20 00 00 cmp rcx,0x2000
64f: 7e ef jle 640 <main+0xa0>