3

I need to do a lexicographic comparison of a small number of small unsigned integers. If there are (for example) 8 8-bit integers, the obvious approach is to byteswap them and do an ordinary integer compare in a GPR. If there are 2 32-bit integers, a 32-bit rotate and an ordinary compare will do the trick. What if there are 4 16-bit integers? Obviously with a vector register it is easy to shuffle them, but is there an efficient approach—either to reversing their order, or to doing the compare without reversing order—using only GPR?

phuclv
  • 37,963
  • 15
  • 156
  • 475
Moonchild
  • 483
  • 1
  • 4
  • 15
  • 3
    I would suggest adding a (possibly slow) reference implementation of the functionality for which you seek a high-performance solution. This will allow easy checking of the functional correctness of any proposed solutions, as well as a performance base line. – njuffa May 15 '22 at 07:12
  • If you first reverse the bytes with `bswap`, you then just need to reverse each pair of adjacent bytes. You could do that with a shift right by 8 and mask off the odd bytes, shift left by 8 and mask off the even bytes, and then OR the results together. – Nate Eldredge May 15 '22 at 07:51
  • Or switch to ARM64 and do it in two instructions: `rev x0, x0 ; rev16 x0, x0`. :) – Nate Eldredge May 15 '22 at 07:58
  • 1
    [CMPSW](https://faydoc.tripod.com/cpu/cmpsw.htm) is fairly efficient, if the operands are in memory. – TonyK May 15 '22 at 12:16
  • @TonyK: If you're willing to compare 1 `short` at a time, `cmp ax, dx` is more efficient than setting up pointers for `cmpsw`. Or `movzx eax, [rsi+2]` / `cmp ax, [rdi+2]` to do in 2 uops what `cmpsw` does in 5 on Skylake, 7 on Alder Lake-P. (For small fixed-length buffers, I'm assuming loop unrolling so the pointer-increment part of cmpsw isn't necessary. Or just use indexed addressing modes with movzx/cmp, at worst defeating micro-fusion or maybe macro-fusion of cmp/jcc so `cmp ax, [rdi+rcx]/jcc` could be 3 uops vs. at best 1. It's [2 on SKL](https://stackoverflow.com/a/56413946/224132)) – Peter Cordes May 15 '22 at 12:41
  • 1
    @TonyK: TL:DR: No, `cmpsw` isn't efficient on modern Intel or AMD, compared to doing it manually. `repe cmpsw` is not efficient either, with startup overhead but not having fast-strings microcode so it only goes 1 word at a time. The `repe` version may avoid branch misprediction about where the first difference is, but microcode startup overhead is almost as bad. – Peter Cordes May 15 '22 at 12:43
  • I'd guess 2x `movq xmm, reg` / 2x `pshuflw` to word-reverse / SSE4.2 `pcmpgtq` could be worth trying. Or SSE2 `pcmpeqw` / `pmovmskb` / `not ax` / `bsr` to find the first difference? – Peter Cordes May 15 '22 at 12:48
  • @PeterCordes: Yes, I meant `repe compsw`. Thank you for catching that. (But don't you have to say `movzx eax, word ptr[rsi+2]` etc?) – TonyK May 15 '22 at 12:50
  • Moonchild, where are these four 16-bit integers? In memory, or registers? – TonyK May 15 '22 at 12:52
  • @TonyK: Oh yes, MOVZX needs an explicit source size. Apparently I tend to forget that when writing SO comments and thinking just about performance, since both sizes of movzx perform the same. :P (https://uops.info/) – Peter Cordes May 15 '22 at 12:58
  • @PeterCordes: I wasn't thinking about performance, I was thinking about saving the OP some trouble if they try to use your suggestion. – TonyK May 15 '22 at 13:12
  • @TonyK: Yes, I know, thanks for catching my mistake. ecm had to correct me on the same omission in another comment within the past couple weeks, so I was just musing about how I managed to get sloppy. – Peter Cordes May 15 '22 at 13:14
  • They are in memory. One thing I was considering is maintaining a shortish buffer of pre-reversed strings, computed using simd; but it would be nice not to have to bother. – Moonchild May 15 '22 at 20:58

1 Answers1

1

For the reverse alone, here's my attempt:

wswap2:
        ;;  rdi = ABCD (words)
        mov rax, rdi            
        ror edi, 16             ; rdi = 00DC
        shl rdi, 32             ; rdi = DC00
        shr rax, 32             ; rax = 00AB
        ror eax, 16             ; rax = 00BA
        or rax, rdi             ; rax = DCBA
        ret

It's convenient to be able to use a 32-bit rotate to swap two adjacent words.

We've got two parallel dependency chains of two uops each, followed by one more uop to merge them.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • You could save 1 byte of code size by using `mov eax,edi` and doing low-half stuff on RAX. Dep-chain lengths are equal so lack of mov-elimination would hurt equally, and you can still schedule the work RDI right after the copy. Or with BMI2, `rorx eax, edi, 16` instead of `mov`/`ror`. (To use rorx as a copy-and-shift, `rorx rax, rdi, 32` to get the high half and let a 32-bit operand-size rotate zero-extend, clearing the high garbage.) – Peter Cordes May 15 '22 at 17:21
  • This is good for latency. If you're only doing this for one pair of input numbers before a `cmp` (8 total shift/rotate ops), ALU throughput bottlenecks probably won't be a factor. (2/clock shifts and rotate on modern Intel, vs. 4/clock on Zen). It's 6 uops for the front-end, so if that's more important than latency, the thing to compare against is probably `movq xmm0, rdi` / `pshuflw xmm0, 0b00_01_10_11` / `movq rax, xmm0`. (With SSE4.2 `pcmpgtq` as an option instead of copying back to integer, if signed compare works.) On Intel, 2 of the 3 uops are for port 5 (shuffle and `movq xmm, r64`). – Peter Cordes May 15 '22 at 17:31