Fully utilizing pipelines on kaby lake

Question

(Followup code-review question here, with more details of the context of this loop.)

Environment:

Windows 7 x64
VS 2017 community
Targeting x64 code on Intel i7700k (kaby lake)

I don't write a lot of assembler code, and when I do, it's either short enough or simple enough that I don't have to worry much about squeezing the maximum amount of perf out of it. My more complex code is usually written in C and I let the compiler's optimizers worry about latency, code alignment, etc.

However in my current project, MSVC's optimizer does a remarkably poor job on the code in my critical path. So...

I haven't yet found a good tool that does either a static or runtime analysis of x64 assembler code with a view to removing stalls, improving latency, etc. All I've got is the VS profiler which tells me (roughly) which instructions are taking the most time. And the clock on the wall that tells me if the latest change has made things better or worse.

As an alternative, I've been slogging my way thru Agner's docs with the hope of squeezing some more perf out of my code. The problem is that it's hard to understand any of his work until you understand all of it. But pieces of it make sense, and I'm trying to apply what I have learned.

What that in mind, here's the core of my innermost loop which (not surprisingly) is where the VS profiler says my time is being spent:

nottop:

vpminub ymm2, ymm2, ymm3 ; reset out of range values
vpsubb  ymm2, ymm2, ymm0 ; take a step

top:
vptest  ymm2, ymm1       ; check for out of range values
jnz nottop

; Outer loop that does some math, does a "vpsubb ymm2, ymm2, ymm0",
; and eventually jumps back to top

Yup, this is pretty much a textbook example of a dependency chain: Each instruction in this tight little loop depends upon the results of the previous operation. This means there can be no parallelism, which means I'm not taking full advantage of the processor.

Inspired by Agner's "optimizing assembler" doc, I've come up with an approach that (hopefully) allows me to do 2 operations at a time, so I could have one pipeline updating ymm2 and another updating (say) ymm8.

It's a non-trivial change though, so before I start ripping everything apart, I wonder if it's likely to help. Looking at Agner's "Instruction tables" for kaby lake (my target), I see that:

        uops
        each
        port    Latency
pminub  p01     1
psubb   p015    1
ptest   p0 p5   3

Given this, it looks like while one pipeline is using p0+p5 to do the vptest against ymm2, another can be utilizing p1 to do both vpminub and vpsubb on ymm8. Yeah, things are still going to get stacked up behind vptest, but it should help.

Or would it?

I'm currently running this code from 8 threads (Yes, 8 threads really does give me better total throughput than 4,5,6 or 7). Given that my i7700k has 4 hyperthreaded cores, wouldn't the fact that there are 2 threads running on each core mean that I'm already maxing out the ports? Ports are "per core," not "per logical cpu," right?

So.

Based on my current understanding of Agner's work, it appears that there is no way to further optimize this code in its current form. If I want better perf, I'm going to need to come up with a different approach.

And yes, I'm sure if I posted my whole asm routine here, someone could suggest an alternate approach. But the purpose of this Question isn't to have someone write my code for me. I'm trying to see if I'm starting to understand how to think about optimizing asm code.

Is this (roughly) the right way to look at things? Am I missing a few pieces? Or is this flat-out wrong?

*"I haven't yet found a good tool that does either a static or runtime analysis of x64 assembler code with a view to removing stalls, improving latency, etc."* Meet [IACA](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/). The latest microarchitecture it supports is Skylake, but as far as I know, there is relatively little difference between SKL and KBL. (Unfortunately, this tool will become less useful in the future if Intel ceases to continue updating it. :-() — Cody Gray - on strike, Jul 16 '17 at 11:09
@CodyGray Thanks for the introduction. I bumped into iaca after I posted this question (see my "partial answer" below). The problem with "relatively little difference" is that I have no idea where those differences might lie. Changes to SSE seem like a plausible area for improvements when going from "Generation 6" to "Generation 7." I'm not excited about the idea of trying to define where these differences might lie. That aside, iaca isn't telling me much that I hadn't gleaned from Agner's work. It just would have saved me the lookups for the instructions to see what ports they use. — David Wohlferd, Jul 16 '17 at 11:22
Relatively little difference is perhaps an overstatement. As far as I know Kaby Lake is the same micro-architecture as Skylake and the timings are the same. Little difference would be something like a Haswell to Broadwell or Broadwell to Skylake jump, but Kaby Lake really is a "no op". — BeeOnRope, Jul 17 '17 at 18:30
@DavidWohlferd - what are typical values of `ymm3` and `ymm0` on entry to the `nottop` loop? Do they change in the outer loop? — BeeOnRope, Jul 17 '17 at 19:10
@BeeOnRope - ymm0 and ymm3 are both constants, loaded once at initialization. That said, the purpose of this question was to test my understanding of asm optimization. I've got some of the basics, but clearly I've got a long way to go. I'm hoping to post a runable version of this code on codereview soon. — David Wohlferd, Jul 17 '17 at 22:51
I guess there is a really good chance that the whole thing can be sped up a lot once the behavior of the outer loop is understood. — BeeOnRope, Jul 17 '17 at 22:53

Peter Cordes · Accepted Answer · 2017-07-17T01:17:57.623

6

TL:DR: I think Hyperthreading should keep all your vector ALU ports busy with 2 threads per core.

vptest doesn't write either vector register, only flags. The next iteration doesn't have to wait for it, so its latency is mostly irrelevant.

Only jnz is dependent on vptest, and speculative execution + branch prediction hides the latency of control dependencies. vptest latency is relevant for how quickly a branch mispredict can be detected, but not for throughput in the correctly-predicted case.

Good point about hyperthreading. Interleaving two independent dep chains within a single thread can be helpful, but it's a lot harder to do correctly and efficiently.

Let's look at the instructions in your loop. predicted-taken jnz will always run on p6, so we can discount it. (Unrolling could actually hurt: predicted-not-taken jnz can also run on p0 or p6)

On a core by itself, your loop should run at 2 cycles per iteration, bottlenecked on latency. It's 5 fused-domain uops, so it takes 1.25 cycles to issue. (Unlike test, jnz can't macro-fuse with vptest). With hyperthreading, the front-end is already a worse bottleneck than latency. Each thread can issue 4 uops every other cycle, which is less than the 5 uops every other cycle of the dependency-chain bottleneck.

(This is common for recent Intel, especially SKL/KBL: many uops have enough ports to choose from that sustaining 4 uops per clock throughput is realistic, especially with SKL's improved throughput of the uop-cache and decoders to avoid issue bubbles due to front-end limitations rather than the back-end filling up.)

Every time one thread stalls (e.g. for a branch mispredict), the front-end can catch up on the other thread and get lots of future iterations into the out-of-order core for it to chew through at one iter per 2 cycles. (Or less because of execution-port throughput limits, see below).

Execution-port throughput (unfused-domain):

Only 1 of every 5 uops runs on p6 (the jnz). It can't be a bottleneck because the front-end issue rate limits us to less than one branch issuing per clock while running this loop.

The other 4 vector ALU uops per iteration have to run on the 3 ports with vector execution units. The p01 and p015 uops have enough scheduling flexibility that no single port will be a bottleneck, so we can just look at total ALU throughput. That's 4 uops / iter for 3 ports, for a max average throughput for a physical core of one iter per 1.333 cycles.

For a single thread (no HT), this is not the most serious bottleneck. But with two hyperthreads, that's one iter per 2.6666 cycles.

Hyperthreading should saturate your execution units, with some front-end throughput to spare. Each thread should average one per 2.666c, with the front-end able to issue at one per 2.5c. Since latency only limits you to one per 2c, it can catch up after any delays on the critical-path due to resource conflicts. (a vptest uop stealing a cycle from one of the other two uops).

If you can change the loop to check any less often, or with fewer vector uops, that could be a win. But everything I'm thinking of is more vector uops (e.g. vpand instead of vptest and then vpor a couple of those results together before checking... Or vpxor to produce an all-zero vector when vptest would). Maybe if there was a vector XNOR or something, but there isn't.

To check what's actually happening, you could use perf counters to profile your current code, and see what uop throughput you're getting for a whole core (not just each logical thread separately). Or profile one logical thread and see if it's saturating about half of p015.

edited Jul 17 '17 at 01:17

answered Jul 17 '17 at 01:09

Peter Cordes

328,167
45
605
847

"Interleaving two independent dep chains within a single ...harder to do correctly and efficiently." I wasn't looking forward to it, but I *think* I could have made it work. "Hyperthreading should keep all your vector ALU ports busy" - Aha! So despite my inexperience, I ~got this right. I'll need to read this again, but the net seems to be that while dependency chains should usually be avoided, in this case the hyperthreading is already maxing out the compute abilities of the core (as I suspected). If I can't think of anything else to try, I may post a bigger chunk of this over on codereview. – David Wohlferd Jul 17 '17 at 01:45
1

@DavidWohlferd: ping me here if you post at codereview. There are so few asm/perf questions that I don't usually follow it. re: *dependency chains should usually be avoided*, that's obviously impossible (using outputs of instructions is kinda the point :), but avoiding latency *bottlenecks* is ideal. Some things are inherently serial, and finding ways to overlap or interleave is cool. Especially when you also care about CPUs without HT. I mentioned interleaving in [my collatz conjecture optimization answer](https://stackoverflow.com/a/40355466/224132) – Peter Cordes Jul 17 '17 at 01:58
1

It seems the biggest thing you missed is that your critical path loop-carried dep chain is only 2 cycles, not 5, because vptest->jnz forks off every iteration. So you already have a decent amount of ILP in a single thread. – Peter Cordes Jul 17 '17 at 01:58
2

"change the loop to check less often" - Can't. "or with fewer vector uops" - Ahh... The problem with `vpand` et al is (unlike `and`) they don't set flags for jnz. I'd still need some way to get from ymm down to gp :(. But a scan of SSE instructions reminded me about `vpmovmskb`. While I never said, ymm1 is just the upper bit of each byte. So `vpmovmskb eax, ymm2 ; test eax, eax` effectively does the same thing as `vptest`. The test & jnz fuse, so even with 1 'more' instruction, this saves me ~5% off my execution time. Unfortunately while you gave 2 useful answers, I can only give you 1 upvote. – David Wohlferd Jul 17 '17 at 06:15
@DavidWohlferd: ah, yeah if you'd said that I would have told you to go for `vpmovmskb` right away :P The real use-cases for `vptest` are surprisingly rare, since it's 2 uops and can't macro-fuse. It can be a win if you're using `cmov` or `setcc` instead of branching on flags, but otherwise it's only break even vs. `vpcmpeq` / `vpmovmskb` / `test+jcc`. And if you want to bit-scan or popcnt the mask after breaking out of the loop, you already have it in a reg. (Without AVX, though, `ptest` is non-destructive so can save a `movdqa`). – Peter Cordes Jul 17 '17 at 07:00
@DavidWohlferd It isn't clear to me that you can't make this faster by doing more `vpsubb` instructions in a row without either the `vpminub` or test stuff, or perhaps you can avoid the inner loop entire by direct calculation. For example, it seems the inner loop exists when all values are < 127, after repeatedly subtracting the values in `ymm0`, right? The `vpminub` is basically to avoid underflow for those values which have already passed that boundary? Depending on the exact value of `ymm3` it seems that many values may anyway be clamped to `ymm3`? – BeeOnRope Jul 17 '17 at 20:57
You can, in principle, just calculate how many iterations the loop will take, directly: it's something like `max((ymm2 - 0x80) / ymm3)`, right? That is, each element implies a certain iteration count and the total count just the largest of these. Now you don't want to actually do a division, but since since these are bytes, and especially given the possible domain of the values in `ymm2` and `ymm3` you might be able to do this quickly, in which case you can just calculate the final value directly (but if there is no SIMD byte-wise multiplication either so more tricks). – BeeOnRope Jul 17 '17 at 21:01
Depending on underflow considerations, you could perhaps just subtract `N * ymm0` each time, cutting the iterations by N (e.g., N could be 4 or 8), and then when you detect your exit condition you "redo" the last 4 iterations to get the right value. – BeeOnRope Jul 17 '17 at 21:03
1

@BeeOnRope My first reaction was "that's not going to work." But that's just me being defensive about my work. I've tried to come up with ways to do direct calculation without success (obviously). Doesn't mean it can't be done tho. And while the number of iterations of the inner loop can vary between 1 and ~103, it is weighted heavily toward 1. "Re-doing" even with N=2 might not be a net win. But I'll need to try it to say for sure. If I can hack out all the 3rd party libraries and specific hw I'm using, I'll post this to codereview. I'll ping both you and Peter here when I've got it. – David Wohlferd Jul 17 '17 at 23:10
1

If the loop usually iterates 1 time, then direct calculation or doing more than 1 sub at a time is mostly going to be useless. Its weird though that the inner loop would show up as the bottleneck then given that the outer loop presumably does more stuff and they are running about the same number of times. Perhaps the culprit is branch mis-predictions on the inner loop exit. On Linux you'd use `perf` or `toplev.py` to drill down, on Windows I'm not sure what the best free tools are, but VTune is a great one, and has free a free eval period. – BeeOnRope Jul 17 '17 at 23:16
If the loop often only runs once, a lot of the time spent may be on branch-mispredicts! (That's good for hyperthreading, BTW. Not perfect because some mis-speculated uops will actually run and take resources away from the other thread before the correct path is found and uops from it have to be issued. But resources that would otherwise have been idle while recovering from a mispredict all go to the other thread.) – Peter Cordes Jul 17 '17 at 23:22
1

@BeeOnRope I was unable to make the N=2 approach work. However, I have now [posted](https://codereview.stackexchange.com/q/169819/110050) the code if you want to have a go at it. – David Wohlferd Jul 21 '17 at 06:47
@PeterCordes - You said you were interested too ([link](https://codereview.stackexchange.com/q/169819/110050)). – David Wohlferd Jul 21 '17 at 06:48
Thanks @DavidWohlferd! To be clear, over on code review you are open to bigger algorithmic changes that still get the same result (i.e., identifying likely primes), right? If you want the exact algorithm with only tweaks to the inner loop, you aren't going to get much better than Peter's answer here. Also, while you do a good job of describing various low level details, it would have been awesome to have a one or two sentence summary of the actual `ProcessA` algorith, e.g., "all likely primes in [start, end] are identified by ..." – BeeOnRope Jul 21 '17 at 17:39
2

@BeeOnRope "algorithmic changes" - While I was creating that sample for CR, I realized I had an idea for what to try next. AAR, I almost didn't post the question. But I realized that in order to decide if my "next idea" was better, I needed to make sure I'd optimized the current code as much as I can. So I guess I'm open to either. – David Wohlferd Jul 21 '17 at 19:41

score 1 · Answer 2 · answered Jul 16 '17 at 11:12

1

A partial answer:

Intel provides a tool named Intel Architecture Code Analyzer (described here) that does static analysis of code, showing (kind of) what ports are in use in a section of asm code.

Unfortunately:

v2.3 doesn't include the essential (and only) header file. You can find this file in v2.2.
v2.2 includes the header, but omits a python script (pt.py) for analyzing the output. This file is also not included in v2.3 (haven't found it yet).
One of the output formats from iaca is a .dot file, that is read by graphviz, but the Intel docs fail to describe which of the 38 executables in graphviz is used to display output.

But perhaps most importantly (for my needs):

v2.3 (currently the most recent release) supports Skylake, but not Kaby Lake.

Given how the implementation details change between processors, that makes all the output suspect. Dates in the pdf file suggest that v2.3 was released July 2017, meaning that I'm probably going to have to wait a while for the next release.

answered Jul 16 '17 at 11:12

David Wohlferd

7,110
2
29
56

Hmm, that's odd that they omitted the header file from v2.3. Kind of sloppy. Although it is nice to see that there was a new version released this month. I had heard rumors somewhere that they were going to stop maintaining IACA, so hopefully those turned out to be false. As far as the output, I've never tried using the Python script. I just look at the text output directly in the file (or output it to the command prompt). It is quite easy to read, in my opinion. Far easier than me learning Python or graphviz first. – Cody Gray - on strike Jul 16 '17 at 11:20
1

And I wouldn't worry too much about Skylake vs. Kaby Lake. KBL is just a minor "optimization" step of SKL, really just a process shrink and a clock-speed bump. [Very little of the microarchitecture changed](https://techreport.com/review/30587/intel-kaby-lake-cpus-revealed). I think the output would be pretty trustworthy. And honestly, even if it isn't perfect, it would definitely help you to get your head around the code in a way that Agner's manuals don't (unless, like you said, you're already an expert). – Cody Gray - on strike Jul 16 '17 at 11:21
Ok, I'll spend some more time with them. There may be pearls there yet. – David Wohlferd Jul 16 '17 at 11:24
BTW, the python script is for processing the `–trace` output. But I may not be ready for that level of detail anyway. – David Wohlferd Jul 16 '17 at 11:27
"Very little of the microarchitecture changed" - Oh my heck: There are -fins- (big ones) in Kaby Lake! – David Wohlferd Jul 16 '17 at 11:38
AFAIK, Kaby has *no* microarchitectural improvements over SKL. Clock-for-clock identical to SKL. The performance gains are all from power improvements that allow it to turbo higher for more of the time. Also it has some updates in the iGPU, like hardware decode of 10-bit h.265. You're not missing anything by using analyzing for SKL. This isn't the first time Intel has done this: Nehalem->Westmere was a die-shrink with zero changes to clock-for-clock performance, only to the silicon it's built on. – Peter Cordes Jul 17 '17 at 00:17
Thanks for posting that IACA has been updated! I wonder if they fixed any of the minor uop-count bugs for specific instructions in existing uarches... – Peter Cordes Jul 17 '17 at 00:20
`.dot` files can be converted using `dot -Tpng -o output.png input.dot` – ninjalj Jul 17 '17 at 11:01
@ninjalj - Ahh, thank you. It's annoying that they just expect you to "know" how to use this other (never heard of it) tool. One command line example (like yours) would have made a big difference. – David Wohlferd Jul 17 '17 at 23:17
I view `.dot` files on Linux with `xdot` like `xdot FILENAME`. Works fine for IACA files. – BeeOnRope Jul 21 '17 at 17:37

Fully utilizing pipelines on kaby lake

2 Answers2