pcmpestri character units and countdown - x86-64 asm

Question

I’m trying to write a minimal loop around pcmpestri in x86-64 asm (actually in-line asm embedded in Dlang using the GDC compiler). There are a couple of things that I don’t understand

I you are using pcmpestri with two pointers to strings, are the lengths of the strings in rax and rdx ?
If so, what are the units? count in bytes always, or count in chars where 1 count = 2 bytes for uwords ?
Does pcmpestri check for short strings? ie len str1 or str2 < 16 bytes or 8 uwords if uwords
Does pcmpestri count rax and rax down by n per chunk or do I have to do it ? subtracting either 16 always or (16 or 8 depending on bytes/uwords)?
Do I have to worry about 128-bit alignment on the fetch below? I could precheck that the string is 128-bit aligned if it’s faster, but then that could get really messy. If I use instructions that don’t require 128-bit alignment how much slower will that be? see below
Is it slower to use lea %[offset], [ %[offset] - 16 ] before the ja ? (chosen as it doesn’t set flags)
Worth loop-unrolling? Or a terrible idea ?
What info do I need to pass back to the hi-level lang code? rcx I know i one thing, the flags too or can I forget about them? (In an earlier routine I passed back true if cond ‘na’ if final ja not-taken.)
One final question: what about passing back updated offset?

leaving out required preamble I have:

; having reserved say xmm1 as a working variable

loop:   add       %[offset], 16  ; 16 bytes = nbytes of chunk of string
; do I need to count  lengths of strings down ? by 16 per chunk or by (8 or 16) per chunk ?
        movdqa    xmm1, [ %[pstr1] + %[offset] - 16 ]    ; -16 to compensate for pre-add
        pcmpestri xmm1, [ %[pstr1] + %[offset] - 16 ], 0 ; mode=0 or 1 for uwords
        ja      loop

; what do I do about passing back info to the main code? ; I already pass back rcx = offset-in-chunk, do I need to pass the flags back too ; I have reserved rcx by declaring it as an output ; what about passing down the value of %[offset]? or passing the counted-down lengths?

I haven’t managed to find examples that feature words rather than bytes.

And for a 1-string usage pattern, where I have reserved say xmm1 as an input argument xmm reg :

loop:   add       %[offset], 16  ; 16 bytes = nbytes of chunk of string
        pcmpestri xmm1, [ %[pstr1] + %[offset] - 16 ], 0 ; mode=0 or 1 for uwords
        ja      loop

I decided to do the minimum amount possible in the asm code and leave the logic to be done outside it. There is already a one-line asm wrapper around pcmpestri alone, but we need to include a jna instruction in the asm for speed, since it’s awkward for the D code to have to deal with the clunky mechanism for passing the state of the flags back in such a hot loop. So I’m including a couple of scenarios which do the loop for you complete with jna instruction - one for a two pointers use case and another for just one ptr and one fixed argument in an xmm reg. — Cecil Ward, Jul 22 '20 at 21:50

score 1 · Answer 1 · answered Jul 22 '20 at 01:56

See also

; compile with FASM
; Immediate byte constants
EQUAL_ANY       = 0000b
RANGES          = 0100b
EQUAL_EACH      = 1000b
EQUAL_ORDERED       = 1100b
NEGATIVE_POLARITY = 010000b
BYTE_MASK    = 1000000b

; ==== strcmp ====

strcmp_sse42:
  ; Using __fastcall convention, ecx = string1, edx = string2
  mov eax, ecx
  sub eax, edx ; eax = ecx - edx
  sub edx, 16

STRCMP_LOOP:
    add edx, 16
    MovDqU    xmm0, dqword[edx]
    ; find the first *different* bytes, hence negative polarity
    PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY
    ja STRCMP_LOOP

  jc STRCMP_DIFF

  ; the strings are equal
  xor eax, eax
  ret
STRCMP_DIFF:
  ; subtract the first different bytes
  add eax, edx
  movzx eax, byte[eax + ecx]
  movzx edx, byte[edx + ecx]
  sub eax, edx
  ret


; ==== strlen ====
strlen_sse42:
  ; ecx = string
  mov eax, -16
  mov edx, ecx
  pxor xmm0, xmm0

STRLEN_LOOP:
    add eax, 16
    PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH
    jnz STRLEN_LOOP

  add eax, ecx
  ret

; ==== strstr ====
strstr_sse42:
  ; ecx = haystack, edx = needle

  push esi
  push edi
  MovDqU xmm2, dqword[edx] ; load the first 16 bytes of neddle
  Pxor xmm3, xmm3
  lea eax, [ecx - 16]

  ; find the first possible match of 16-byte fragment in haystack
STRSTR_MAIN_LOOP:
    add eax, 16
    PcmpIstrI xmm2, dqword[eax], EQUAL_ORDERED
    ja STRSTR_MAIN_LOOP

  jnc STRSTR_NOT_FOUND

  add eax, ecx ; save the possible match start
  mov edi, edx
  mov esi, eax
  sub edi, esi
  sub esi, 16

  ; compare the strings
@@:
    add esi, 16
    MovDqU    xmm1, dqword[esi + edi]
    ; mask out invalid bytes in the haystack
    PcmpIstrM xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK
    MovDqU xmm4, dqword[esi]
    PAnd xmm4, xmm0
    PcmpIstrI xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
    ja @B

  jnc STRSTR_FOUND

  ; continue searching from the next byte
  sub eax, 15
  jmp STRSTR_MAIN_LOOP

STRSTR_NOT_FOUND:
  xor eax, eax

STRSTR_FOUND:
  pop edi
  pop esi
  ret

From Implementing strcmp

Some ASM from the Intel Intrinsic Guide showing the operation:

size := (imm8[0] ? 16 : 8) // 8 or 16-bit characters
UpperBound := (128 / size) - 1
BoolRes := 0
// compare all characters
aInvalid := 0
bInvalid := 0
FOR i := 0 to UpperBound
    m := i*size
    FOR j := 0 to UpperBound
        n := j*size
        BoolRes.word[i].bit[j] := (a[m+size-1:m] == b[n+size-1:n]) ? 1 : 0
        
        // invalidate characters after EOS
        IF i == la
            aInvalid := 1
        FI
        IF j == lb
            bInvalid := 1
        FI
        
        // override comparisons for invalid characters
        CASE (imm8[3:2]) OF
        0:  // equal any
            IF (!aInvalid && bInvalid)
                BoolRes.word[i].bit[j] := 0
            ELSE IF (aInvalid && !bInvalid)
                BoolRes.word[i].bit[j] := 0
            ELSE IF (aInvalid && bInvalid)
                BoolRes.word[i].bit[j] := 0
            FI
        1:  // ranges
            IF (!aInvalid && bInvalid)
                BoolRes.word[i].bit[j] := 0
            ELSE IF (aInvalid && !bInvalid)
                BoolRes.word[i].bit[j] := 0
            ELSE IF (aInvalid && bInvalid)
                BoolRes.word[i].bit[j] := 0
            FI
        2:  // equal each
            IF (!aInvalid && bInvalid)
                BoolRes.word[i].bit[j] := 0
            ELSE IF (aInvalid && !bInvalid)
                BoolRes.word[i].bit[j] := 0
            ELSE IF (aInvalid && bInvalid)
                BoolRes.word[i].bit[j] := 1
            FI
        3:  // equal ordered
            IF (!aInvalid && bInvalid)
                BoolRes.word[i].bit[j] := 0
            ELSE IF (aInvalid && !bInvalid)
                BoolRes.word[i].bit[j] := 1
            ELSE IF (aInvalid && bInvalid)
                BoolRes.word[i].bit[j] := 1
            FI
        ESAC
    ENDFOR
ENDFOR
// aggregate results
CASE (imm8[3:2]) OF
0:  // equal any
    IntRes1 := 0
    FOR i := 0 to UpperBound
        FOR j := 0 to UpperBound
            IntRes1[i] := IntRes1[i] OR BoolRes.word[i].bit[j]
        ENDFOR
    ENDFOR
1:  // ranges
    IntRes1 := 0
    FOR i := 0 to UpperBound
        FOR j := 0 to UpperBound
            IntRes1[i] := IntRes1[i] OR (BoolRes.word[i].bit[j] AND BoolRes.word[i].bit[j+1])
            j += 2
        ENDFOR
    ENDFOR
2:  // equal each
    IntRes1 := 0
    FOR i := 0 to UpperBound
        IntRes1[i] := BoolRes.word[i].bit[i]
    ENDFOR
3:  // equal ordered
    IntRes1 := (imm8[0] ? 0xFF : 0xFFFF)
    FOR i := 0 to UpperBound
        k := i
        FOR j := 0 to UpperBound-i
            IntRes1[i] := IntRes1[i] AND BoolRes.word[k].bit[j]
            k := k+1
        ENDFOR
    ENDFOR
ESAC
// optionally negate results
FOR i := 0 to UpperBound
    IF imm8[4]
        IF imm8[5] // only negate valid
            IF i >= lb // invalid, don't negate
                IntRes2[i] := IntRes1[i]
            ELSE // valid, negate
                IntRes2[i] := -1 XOR IntRes1[i]
            FI
        ELSE // negate all
            IntRes2[i] := -1 XOR IntRes1[i]
        FI
    ELSE // don't negate
        IntRes2[i] := IntRes1[i]
    FI
ENDFOR
// output
IF imm8[6] // most significant bit
    tmp := UpperBound
    dst := tmp
    DO WHILE ((tmp >= 0) AND a[tmp] == 0)
        tmp := tmp - 1
        dst := tmp
    OD
ELSE // least significant bit
    tmp := 0
    dst := tmp
    DO WHILE ((tmp <= UpperBound) AND a[tmp] == 0)
        tmp := tmp + 1
        dst := tmp
    OD
FI

From Intel Intrinsic Guide

Jay and Peter, thanks for your generous help. I’m an experienced professional asm programmer, from decades ago, but I’ve found various examples that were unhelpful and descriptions of the instructions that were very confusing. — Cecil Ward, Jul 22 '20 at 21:09
I’ve read the above ‘implementing strcmp’ article already. I have a question - in the no carry case, is ECX still valid ? — Cecil Ward, Jul 22 '20 at 21:11

score 1 · Answer 2 · answered Jul 22 '20 at 07:12

In your main loop (while remaining lengths of both input strings are >=16), use pcmpistri (the implicit-length string version) if you know there are no 0 bytes in your data. pcmpistri significantly faster and fewer uops on most CPUs, perhaps because it only has 3 inputs (including the immediate) instead of 5. (https://uops.info/)

Do I have to worry about 128-bit alignment on the fetch below?

Yes for movdqa of course, but surprisingly the SSE4.2 string instructions don't fault on misaligned memory operands! For the legacy SSE (non-VEX) encoding of all previous instructions (except unaligned mov like movups / movdqu), 16-byte memory operands must be aligned. Intel's manual notes: "additionally, this instruction does not cause #GP if the memory operand is not aligned to 16 Byte boundary".

Of course you still have to avoid crossing into an unmapped page, e.g. for a 5 byte string that starts 7 bytes before an unmapped page, a 16-byte memory operand will still page-fault. (Is it safe to read past the end of a buffer within the same page on x86 and x64?) I don't see any mention of fault-suppression for the "ignored" part of a memory source operand in Intel's manual, unlike with AVX-512 masked loads.

For explicit-length strings, this is easy: you know when you're definitely far from the end of the shorter string, so you can just special case the last iteration. (And you want to do that anyway so you can use pcmpistri in the main loop).

e.g. do an unaligned that ends at the last byte of the string, if it's at least 16 bytes long, or check (p&4095) <= (4096-16) to avoid a page-crossing load when you're fetching that the end of a string.

So in practice, if both strings have the same relative alignment you can just handle the unaligned starts of the strings, then get into a loop that uses aligned loads from both (so you can keep using movdqa). That can't page-split and thus can't fault when loading any aligned vector that contains any string bytes.

relative misalignment is harder.

For performance, note that SSE4.2 is only supported on Nehalem and newer, where movdqu is relatively efficient (as cheap as movdqa if the pointer happens to be aligned at runtime). I think AMD support is similar; not until Bulldozer which has AVX and cheap unaligned loads. Cache-line splits still hurt some, so if you expect large strings to be common then it's worth maybe hurting the short-string case and/or the already-aligned case by doing some extra checking.

Maybe have a look at what glibc's SSE2 / AVX memcmp implementation does; it has the same problem of reading SIMD vectors from 2 arrays that might be misaligned wrt. each other. (Simple bytewise equality is faster with pcmpeqb so it wouldn't use SSE4.2 string instructions, but the problem of which SIMD vectors to load is the same).

Does pcmpestri check for short strings?

Yes, that's the whole point of taking 2 input lengths (in RAX for XMM1, and RDX for XMM2). See Intel's asm manual entry for pcmpestri.

Does pcmpestri count rax and rax down by n per chunk or do I have to do it

You have to do it if that's what you want; pcmpestri looks at the first RAX bytes/words of XMM1 (up to 16 / 8), and the first RDX bytes (words) of XMM2/mem (up to 16 / 8), and outputs to ECX and EFLAGS. That is all. Again, Intel's manual is pretty clear about this. (Although pretty complicated to understand the actual aggregation and compare options!)

If you wanted to use it in a loop, you could just leave those registers set to 16 and compute them properly for a peeled final iteration after the loop. Or you could decrement them each by 16 each iteration; pcmpestri appears to be designed for doing that, setting ZF and/or SF if EDX and/or EAX are < 16 (or 8), respectively.

See also https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for a useful high-level picture of the processing steps the SSE4.2 string instructions do, so you can figure out how to design useful ways to use them. And some examples like implementing strcmp and strlen. Intel's detailed documentation in the SDM gets bogged down in details and hard to take in the big picture.

(A good unrolled SSE2 implementation can beat SSE4.2 for those simple functions, but a simple problem makes a good example.)

What info do I need to pass back to the hi-level lang code?

Ideally you'd have proper intrinsics, not just wrappers for inline asm.

It probably depends what high-level code wants to do with it, although for pcmpestri specifically, all the information is present in in ECX (the integer result). CF = (ECX == 0), and OF = ECX[0] (low bit). If GDC has GCC6 flag-output syntax, it wouldn't hurt I guess, unless it tricks the compiler into making worse code to receive those outputs.

If you are using inline-asm to basically create intrinsics for SSE4.2 string instructions, it might be worth looking at Intel's design for C intrinsics: https://software.intel.com/sites/landingpage/IntrinsicsGuide/.

e.g. one for the ECX result, int _mm_cmpestri (__m128i a, int la, __m128i b, int lb, const int mode);
And one each for each separate FLAG output bit, like _mm_cmpestro

However, there are flaws in Intel's design. For example, with the implicit-length string version at least, I remember that the only way to get an integer result and get the compiler to branch on FLAGS directly from the instruction was to use two different intrinsics with the same inputs, and depend on the compiler optimizing them together.

With inline asm, it's easy to describe multiple outputs and have unused ones be optimized away. But unfortunately C doesn't have syntax for multiple return values, and I guess Intel didn't want to have an intrinsic with a by-reference output arg as well as a return value.

Is it slower to use lea %[offset], [ %[offset] - 16 ] before the ja ? (chosen as it doesn’t set flags)

I'd do the movdqa load first, then add, then pcmpistri. That keeps the movdqa addressing mode simpler and smaller, and lets the first iteration's load start executing 1 cycle earlier, without waiting for the latency of an add (if the index was on the critical path; it might not be if you started at 0)

Using an indexed addressing mode is probably not harmful here (a multi-uop instruction like pcmpe/istri probably can't micro-fuse a load anyway, and movdqa / movdqu don't care). But in other cases it can be worth it to unroll and use pointer increments instead: Micro fusion and addressing modes

It might be worth unrolling by 2. I'd suggest counting uops to see if it's just above a multiple of 4, and/or trying it on a couple CPUs like Skylake and Zen.

Thanks so very much for your generosity Peter. My question about passing back info was because I was intending to pass back the state of the carry flag (only), and I wondered if this was the right decision here. I’ve already written a one-liner wrapper around pcmpestri which returns RCX and the ‘a/na’ flags’ state so the high level language code can effectively do its own jna jump even though it does have direct access to the flags. So a SETNA instruction conveys the state of the two bits back as a bool. — Cecil Ward, Jul 22 '20 at 21:19
My sincere apologies but I don’t follow the point about page faulting. If a page fault is needed then why is that a bad thing other than it’s obviously slowing things down a huge amount ? Is that the point here, one about optimisation, or is there some other problem that I’m not aware of? If it’s about optimisation, should I go straight to the end of the strings, since the lengths are known, and use PREFETCHxx instructions to fetch the last byte in the string to thus ensure that the pages are faulted into memory in good time? Will that help at all or be irrelevant? Many thanks. — Cecil Ward, Jul 22 '20 at 22:46
@CecilWard: I was talking about an *invalid* page fault (Unix SIGSEGV / Windows Access Violation), i.e. a wide load that spans two pages, the end of one that contains string data, and the start of another that's unmapped. If a byte-at-a-time algorithm wouldn't touch the 2nd page, you must not either, because it might not even be mapped. Go read [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739) - yes *within* a page, but not across a page boundary. Thus you have to be careful with unaligned loads. — Peter Cordes, Jul 23 '20 at 02:11
@CecilWard: For `pcmpestri`, all info in FLAGS is also available via `test ecx, something`, or by comparing the lengths. This makes it 100% pointless to waste a `setna` instruction, including for callers wouldn't even read that bool. If the caller wanted to branch on that, they'd do `if(ecx_res & 1) {}` in C or D, and the compiler would emit `test cl, 1` instead of `test dil,dil` if you'd done `setnc dil`. — Peter Cordes, Jul 23 '20 at 02:17
Understood. I’m not trying to reimplement say the C library’s string functions but just give access to the instruction to the D user with the minimum of overhead, so any other chores such as dealing with the important case you kindly pointed out to me where I could easily be reading 15 bytes further than is legal, those can be left to D code to deal with, more efficient and easier to write that part in D where the optimiser can do its thing. — Cecil Ward, Jul 23 '20 at 07:52
Thank you for pointing that out. So as to retain minimal cost and since the spec is up to me, I’m changing my mind and I’ll forbid the use of this in this one specific case and let the caller shorten the string and deal with the remainder byte by byte or from aligned stop-well-short of the end positions so they can use a second call to my routine. — Cecil Ward, Jul 23 '20 at 07:52
About the flags: In an earlier single instruction asm routine I write, I just set up the outputs in the extended asm to return the state of the a/na flags pair by using an =@cca output declaration and then the D compiler emitted a seta instruction. — Cecil Ward, Jul 23 '20 at 08:07
I’ll reread the documentation about ‘test ecx, xx’-available info. Many thanks for pointing that out. I really want to keep this down to nothing so that overhead has to go. Once again your help has improved things greatly. — Cecil Ward, Jul 23 '20 at 08:10
@CecilWard: Just for the record, the key point is that testing a bit in a register is just as efficient as testing a boolean value in a register, so you definitely don't want a `setcc` inside your asm block. If you can use a `"=@cca"` gcc6-style FLAGS output constraint, that should let the compiler optimize away the flag result if its unused. But yes if the caller *does* use the result, the compiler will emit a `seta` to materialize it, or ideally just branch on it if the use was in an `if()` condition. That's the point of flag output constraints. You can have one for each flag condition. — Peter Cordes, Jul 23 '20 at 08:26
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/218418/discussion-between-cecil-ward-and-peter-cordes). — Cecil Ward, Jul 23 '20 at 08:29

pcmpestri character units and countdown - x86-64 asm

2 Answers2

Linked