Performantly reverse the order of 16-bit quantities within a 64-bit word

Question

I need to do a lexicographic comparison of a small number of small unsigned integers. If there are (for example) 8 8-bit integers, the obvious approach is to byteswap them and do an ordinary integer compare in a GPR. If there are 2 32-bit integers, a 32-bit rotate and an ordinary compare will do the trick. What if there are 4 16-bit integers? Obviously with a vector register it is easy to shuffle them, but is there an efficient approach—either to reversing their order, or to doing the compare without reversing order—using only GPR?

I would suggest adding a (possibly slow) reference implementation of the functionality for which you seek a high-performance solution. This will allow easy checking of the functional correctness of any proposed solutions, as well as a performance base line. — njuffa, May 15 '22 at 07:12
If you first reverse the bytes with `bswap`, you then just need to reverse each pair of adjacent bytes. You could do that with a shift right by 8 and mask off the odd bytes, shift left by 8 and mask off the even bytes, and then OR the results together. — Nate Eldredge, May 15 '22 at 07:51
Or switch to ARM64 and do it in two instructions: `rev x0, x0 ; rev16 x0, x0`. :) — Nate Eldredge, May 15 '22 at 07:58
[CMPSW](https://faydoc.tripod.com/cpu/cmpsw.htm) is fairly efficient, if the operands are in memory. — TonyK, May 15 '22 at 12:16
@TonyK: If you're willing to compare 1 `short` at a time, `cmp ax, dx` is more efficient than setting up pointers for `cmpsw`. Or `movzx eax, [rsi+2]` / `cmp ax, [rdi+2]` to do in 2 uops what `cmpsw` does in 5 on Skylake, 7 on Alder Lake-P. (For small fixed-length buffers, I'm assuming loop unrolling so the pointer-increment part of cmpsw isn't necessary. Or just use indexed addressing modes with movzx/cmp, at worst defeating micro-fusion or maybe macro-fusion of cmp/jcc so `cmp ax, [rdi+rcx]/jcc` could be 3 uops vs. at best 1. It's [2 on SKL](https://stackoverflow.com/a/56413946/224132)) — Peter Cordes, May 15 '22 at 12:41
@TonyK: TL:DR: No, `cmpsw` isn't efficient on modern Intel or AMD, compared to doing it manually. `repe cmpsw` is not efficient either, with startup overhead but not having fast-strings microcode so it only goes 1 word at a time. The `repe` version may avoid branch misprediction about where the first difference is, but microcode startup overhead is almost as bad. — Peter Cordes, May 15 '22 at 12:43
I'd guess 2x `movq xmm, reg` / 2x `pshuflw` to word-reverse / SSE4.2 `pcmpgtq` could be worth trying. Or SSE2 `pcmpeqw` / `pmovmskb` / `not ax` / `bsr` to find the first difference? — Peter Cordes, May 15 '22 at 12:48
@PeterCordes: Yes, I meant `repe compsw`. Thank you for catching that. (But don't you have to say `movzx eax, word ptr[rsi+2]` etc?) — TonyK, May 15 '22 at 12:50
Moonchild, where are these four 16-bit integers? In memory, or registers? — TonyK, May 15 '22 at 12:52
@TonyK: Oh yes, MOVZX needs an explicit source size. Apparently I tend to forget that when writing SO comments and thinking just about performance, since both sizes of movzx perform the same. :P (https://uops.info/) — Peter Cordes, May 15 '22 at 12:58
@PeterCordes: I wasn't thinking about performance, I was thinking about saving the OP some trouble if they try to use your suggestion. — TonyK, May 15 '22 at 13:12
@TonyK: Yes, I know, thanks for catching my mistake. ecm had to correct me on the same omission in another comment within the past couple weeks, so I was just musing about how I managed to get sloppy. — Peter Cordes, May 15 '22 at 13:14
They are in memory. One thing I was considering is maintaining a shortish buffer of pre-reversed strings, computed using simd; but it would be nice not to have to bother. — Moonchild, May 15 '22 at 20:58

score 1 · Answer 1 · answered May 15 '22 at 15:45

1

For the reverse alone, here's my attempt:

wswap2:
        ;;  rdi = ABCD (words)
        mov rax, rdi            
        ror edi, 16             ; rdi = 00DC
        shl rdi, 32             ; rdi = DC00
        shr rax, 32             ; rax = 00AB
        ror eax, 16             ; rax = 00BA
        or rax, rdi             ; rax = DCBA
        ret

It's convenient to be able to use a 32-bit rotate to swap two adjacent words.

We've got two parallel dependency chains of two uops each, followed by one more uop to merge them.

answered May 15 '22 at 15:45

Nate Eldredge

48,811
6
54
82

You could save 1 byte of code size by using `mov eax,edi` and doing low-half stuff on RAX. Dep-chain lengths are equal so lack of mov-elimination would hurt equally, and you can still schedule the work RDI right after the copy. Or with BMI2, `rorx eax, edi, 16` instead of `mov`/`ror`. (To use rorx as a copy-and-shift, `rorx rax, rdi, 32` to get the high half and let a 32-bit operand-size rotate zero-extend, clearing the high garbage.) – Peter Cordes May 15 '22 at 17:21
This is good for latency. If you're only doing this for one pair of input numbers before a `cmp` (8 total shift/rotate ops), ALU throughput bottlenecks probably won't be a factor. (2/clock shifts and rotate on modern Intel, vs. 4/clock on Zen). It's 6 uops for the front-end, so if that's more important than latency, the thing to compare against is probably `movq xmm0, rdi` / `pshuflw xmm0, 0b00_01_10_11` / `movq rax, xmm0`. (With SSE4.2 `pcmpgtq` as an option instead of copying back to integer, if signed compare works.) On Intel, 2 of the 3 uops are for port 5 (shuffle and `movq xmm, r64`). – Peter Cordes May 15 '22 at 17:31

Performantly reverse the order of 16-bit quantities within a 64-bit word

1 Answers1