(Followup code-review question here, with more details of the context of this loop.)
Environment:
- Windows 7 x64
- VS 2017 community
- Targeting x64 code on Intel i7700k (kaby lake)
I don't write a lot of assembler code, and when I do, it's either short enough or simple enough that I don't have to worry much about squeezing the maximum amount of perf out of it. My more complex code is usually written in C and I let the compiler's optimizers worry about latency, code alignment, etc.
However in my current project, MSVC's optimizer does a remarkably poor job on the code in my critical path. So...
I haven't yet found a good tool that does either a static or runtime analysis of x64 assembler code with a view to removing stalls, improving latency, etc. All I've got is the VS profiler which tells me (roughly) which instructions are taking the most time. And the clock on the wall that tells me if the latest change has made things better or worse.
As an alternative, I've been slogging my way thru Agner's docs with the hope of squeezing some more perf out of my code. The problem is that it's hard to understand any of his work until you understand all of it. But pieces of it make sense, and I'm trying to apply what I have learned.
What that in mind, here's the core of my innermost loop which (not surprisingly) is where the VS profiler says my time is being spent:
nottop:
vpminub ymm2, ymm2, ymm3 ; reset out of range values
vpsubb ymm2, ymm2, ymm0 ; take a step
top:
vptest ymm2, ymm1 ; check for out of range values
jnz nottop
; Outer loop that does some math, does a "vpsubb ymm2, ymm2, ymm0",
; and eventually jumps back to top
Yup, this is pretty much a textbook example of a dependency chain: Each instruction in this tight little loop depends upon the results of the previous operation. This means there can be no parallelism, which means I'm not taking full advantage of the processor.
Inspired by Agner's "optimizing assembler" doc, I've come up with an approach that (hopefully) allows me to do 2 operations at a time, so I could have one pipeline updating ymm2 and another updating (say) ymm8.
It's a non-trivial change though, so before I start ripping everything apart, I wonder if it's likely to help. Looking at Agner's "Instruction tables" for kaby lake (my target), I see that:
uops
each
port Latency
pminub p01 1
psubb p015 1
ptest p0 p5 3
Given this, it looks like while one pipeline is using p0+p5 to do the vptest against ymm2, another can be utilizing p1 to do both vpminub and vpsubb on ymm8. Yeah, things are still going to get stacked up behind vptest, but it should help.
Or would it?
I'm currently running this code from 8 threads (Yes, 8 threads really does give me better total throughput than 4,5,6 or 7). Given that my i7700k has 4 hyperthreaded cores, wouldn't the fact that there are 2 threads running on each core mean that I'm already maxing out the ports? Ports are "per core," not "per logical cpu," right?
So.
Based on my current understanding of Agner's work, it appears that there is no way to further optimize this code in its current form. If I want better perf, I'm going to need to come up with a different approach.
And yes, I'm sure if I posted my whole asm routine here, someone could suggest an alternate approach. But the purpose of this Question isn't to have someone write my code for me. I'm trying to see if I'm starting to understand how to think about optimizing asm code.
Is this (roughly) the right way to look at things? Am I missing a few pieces? Or is this flat-out wrong?