What causes stalled-cycles-frontend to rise?

Question

I'm optimizing code. My old code uses an if statement and a goto on true and false. My new code looks up data in an array (which I thought might raise stalled backend) then uses a goto on true and false to different labels.

Branch misses dropped 0.01%. Total branches by 0.05%. However stalled-cycles-frontend went from 0.52% to %0.79 which makes it overall slower.

How do I figure out the problem? I'm planning on blindly changing the function structure in hopes frontend cycles will go down. The only other thing I notice is the old gotos went backwards while the new gotos 1 goes forward (close to current line) 1 goes backwards (different location and closer than old two)

Chandler Carruth has some good videos on CPPCON about benchmarking and optimizing. Might want to see his “Going Nowhere Faster” talk. — NathanOliver, Nov 04 '22 at 16:30
@NathanOliver is that the mod benchmark that ended up being an improvement? That is a good one and I think I seen it twice. I loaded it up and jumped into about 19m, I see the `-e` icache load perf line. Trying that I see overall loads (and misses) rise so I wonder if changing my code caused the function to be significantly larger despite it not adding more lines. dcache loads went up slightly and doesn't seem to be the problem — Andrew Benor, Nov 04 '22 at 16:40
@NathanOliver look like that one isn't the mod talk. Its talking about clamping and unwrapping a loop with a multiple. I forgot to mention that my code isn't looping. The old code jumps to the start of the middle of the loop and the new code jumps to the start of the previous if statement or to the next if statement. @ 42m of the talk it's suggesting loop unrolling potentially makes it faster which isn't happening in my case :( — Andrew Benor, Nov 04 '22 at 17:11
https://agner.org/optimize/ has a microarchitecture PDF, describing possible bottlenecks in the front-end of Sandybridge-family CPUs (fetch from uop cache). On Skylake specifically, there are more potholes because microcode updates disabled the LSD, and also [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) — Peter Cordes, Nov 04 '22 at 18:56
What CPU are you testing on? e.g. a model number like i7-6700k or the vendor brand string from /proc/cpuinfo would identify it in enough detail to be useful. Something like "i5" would not be useful at all. — Peter Cordes, Nov 04 '22 at 20:38
@PeterCordes AMD Ryzen 5 3600. I'm halfway through the ASM guide (vol2, "optimization guide for x86 platforms"). I don't remember what it said about the branch predictor. When I tackle this I may want to look at both assembly side by side. Judging by the icache load I suspect changing the goto made the function larger and miss more. I had profile guided optimizations fail (as in same speed or error out during linking). Maybe I should try getting that to work and making a build script or a list of instructions that I can reproduce. --- What section of his guide would you recommend me reading? — Andrew Benor, Nov 04 '22 at 21:03
An extra taken branch (even if unconditional) could also reduce fetch bandwidth from the uop cache, if AMD's works like Intel's and a predicted-taken branch ends a uop cache "line". So yeah, branch layout by the compiler could be a factor, if the fast path has less locality and contiguous fetches. — Peter Cordes, Nov 05 '22 at 01:05
@PeterCordes I'm not 100% sure because I don't have the build from hours ago but it looks like it was completely fixed by using PGO. All I did was add `-fprofile-generate=XYZ` and followed it up by `fprofile-use` and it seems to have fixed everything. Using perf I see the icache misses are no longer terrible (.5 vs .14). I'm not 100% sure if that's the correct way to do profile guided optimization. Also the two builds are different, the second time around I use `nosdtlib` and a macro to use different code. I expected it to fail but it seems to work — Andrew Benor, Nov 05 '22 at 02:17
Yeah, PGO would fix those branch-layout issues, laying things out so the fast paths have fewer taken branches which directly helps the front-end, as well as helping I-cache / uop-cache locality. Yes, profile-generate, run the program on some test inputs that exercise the common cases, then recompile with -fprofile-use. — Peter Cordes, Nov 05 '22 at 02:23
@NathanOliver I spent the last hour write a script so I can test this consistently. Using the perf -e from the talk I saw that using profile guided optimization it basically fixed my code so my changes actually were faster rather then slower just because it generated code differently. icache miss drop down to 1/3rd (so 2/3rd less). It turned out I didn't see that talk so that was a great recommendation, I must have thought I did. Because of it I knew how to measure the difference between the builds and I don't need to guess why it was faster/slow. icache solved by PGO — Andrew Benor, Nov 05 '22 at 02:33
@PeterCordes I'm having weird behavior with PGO sometimes faster and no PGO being sometimes faster (depends how bad the misses are). Any ideas why that might be? The main reason I'm pinging you is I'm reading agner vol 3 for the micro architecture skipping what I didn't read in vol 2 (and not reading vol 1). If that's somehow isn't the microarchitecture pdf you suggested then lmk — Andrew Benor, Nov 07 '22 at 02:46
Compilers aren't perfect, perhaps profile-guided optimization happens to lay something out in a way that creates a problem. Or of course if your actual input data is different from the "training" run(s), in a way that the pattern or likelihood of a branch being taken is different, then PGO would have been optimizing for the wrong thing. Yes I was talking about https://agner.org/optimize/microarchitecture.pdf as a way to understand what possible potholes to look for with profiling. Identifying exactly why some fairly complex asm runs faster or slower on very complex modern CPUs can be *hard* — Peter Cordes, Nov 07 '22 at 03:10

What causes stalled-cycles-frontend to rise?

0 Answers0