In your main loop (while remaining lengths of both input strings are >=16), use pcmpistri
(the implicit-length string version) if you know there are no 0
bytes in your data. pcmpistri
significantly faster and fewer uops on most CPUs, perhaps because it only has 3 inputs (including the immediate) instead of 5. (https://uops.info/)
Do I have to worry about 128-bit alignment on the fetch below?
Yes for movdqa
of course, but surprisingly the SSE4.2 string instructions don't fault on misaligned memory operands! For the legacy SSE (non-VEX) encoding of all previous instructions (except unaligned mov like movups
/ movdqu
), 16-byte memory operands must be aligned. Intel's manual notes: "additionally, this instruction does not cause #GP if the memory operand is not aligned to 16 Byte boundary".
Of course you still have to avoid crossing into an unmapped page, e.g. for a 5 byte string that starts 7 bytes before an unmapped page, a 16-byte memory operand will still page-fault. (Is it safe to read past the end of a buffer within the same page on x86 and x64?) I don't see any mention of fault-suppression for the "ignored" part of a memory source operand in Intel's manual, unlike with AVX-512 masked loads.
For explicit-length strings, this is easy: you know when you're definitely far from the end of the shorter string, so you can just special case the last iteration. (And you want to do that anyway so you can use pcmpistri
in the main loop).
e.g. do an unaligned that ends at the last byte of the string, if it's at least 16 bytes long, or check (p&4095) <= (4096-16)
to avoid a page-crossing load when you're fetching that the end of a string.
So in practice, if both strings have the same relative alignment you can just handle the unaligned starts of the strings, then get into a loop that uses aligned loads from both (so you can keep using movdqa
). That can't page-split and thus can't fault when loading any aligned vector that contains any string bytes.
relative misalignment is harder.
For performance, note that SSE4.2 is only supported on Nehalem and newer, where movdqu
is relatively efficient (as cheap as movdqa
if the pointer happens to be aligned at runtime). I think AMD support is similar; not until Bulldozer which has AVX and cheap unaligned loads. Cache-line splits still hurt some, so if you expect large strings to be common then it's worth maybe hurting the short-string case and/or the already-aligned case by doing some extra checking.
Maybe have a look at what glibc's SSE2 / AVX memcmp
implementation does; it has the same problem of reading SIMD vectors from 2 arrays that might be misaligned wrt. each other. (Simple bytewise equality is faster with pcmpeqb
so it wouldn't use SSE4.2 string instructions, but the problem of which SIMD vectors to load is the same).
Does pcmpestri check for short strings?
Yes, that's the whole point of taking 2 input lengths (in RAX for XMM1, and RDX for XMM2). See Intel's asm manual entry for pcmpestri
.
Does pcmpestri count rax and rax down by n per chunk or do I have to do it
You have to do it if that's what you want; pcmpestri
looks at the first RAX bytes/words of XMM1 (up to 16 / 8), and the first RDX bytes (words) of XMM2/mem (up to 16 / 8), and outputs to ECX and EFLAGS. That is all. Again, Intel's manual is pretty clear about this. (Although pretty complicated to understand the actual aggregation and compare options!)
If you wanted to use it in a loop, you could just leave those registers set to 16
and compute them properly for a peeled final iteration after the loop. Or you could decrement them each by 16 each iteration; pcmpestri
appears to be designed for doing that, setting ZF and/or SF if EDX and/or EAX are < 16 (or 8), respectively.
See also https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for a useful high-level picture of the processing steps the SSE4.2 string instructions do, so you can figure out how to design useful ways to use them. And some examples like implementing strcmp
and strlen
. Intel's detailed documentation in the SDM gets bogged down in details and hard to take in the big picture.
(A good unrolled SSE2 implementation can beat SSE4.2 for those simple functions, but a simple problem makes a good example.)
What info do I need to pass back to the hi-level lang code?
Ideally you'd have proper intrinsics, not just wrappers for inline asm.
It probably depends what high-level code wants to do with it, although for pcmpestri
specifically, all the information is present in in ECX (the integer result). CF = (ECX == 0)
, and OF = ECX[0]
(low bit). If GDC has GCC6 flag-output syntax, it wouldn't hurt I guess, unless it tricks the compiler into making worse code to receive those outputs.
If you are using inline-asm to basically create intrinsics for SSE4.2 string instructions, it might be worth looking at Intel's design for C intrinsics: https://software.intel.com/sites/landingpage/IntrinsicsGuide/.
e.g. one for the ECX result, int _mm_cmpestri (__m128i a, int la, __m128i b, int lb, const int mode);
And one each for each separate FLAG output bit, like _mm_cmpestro
However, there are flaws in Intel's design. For example, with the implicit-length string version at least, I remember that the only way to get an integer result and get the compiler to branch on FLAGS directly from the instruction was to use two different intrinsics with the same inputs, and depend on the compiler optimizing them together.
With inline asm, it's easy to describe multiple outputs and have unused ones be optimized away. But unfortunately C doesn't have syntax for multiple return values, and I guess Intel didn't want to have an intrinsic with a by-reference output arg as well as a return value.
Is it slower to use lea %[offset], [ %[offset] - 16 ] before the ja ? (chosen as it doesn’t set flags)
I'd do the movdqa
load first, then add
, then pcmpistri
. That keeps the movdqa addressing mode simpler and smaller, and lets the first iteration's load start executing 1 cycle earlier, without waiting for the latency of an add
(if the index was on the critical path; it might not be if you started at 0
)
Using an indexed addressing mode is probably not harmful here (a multi-uop instruction like pcmpe/istri
probably can't micro-fuse a load anyway, and movdqa
/ movdqu
don't care). But in other cases it can be worth it to unroll and use pointer increments instead: Micro fusion and addressing modes
It might be worth unrolling by 2. I'd suggest counting uops to see if it's just above a multiple of 4, and/or trying it on a couple CPUs like Skylake and Zen.