I need to do a lexicographic comparison of a small number of small unsigned integers. If there are (for example) 8 8-bit integers, the obvious approach is to byteswap them and do an ordinary integer compare in a GPR. If there are 2 32-bit integers, a 32-bit rotate and an ordinary compare will do the trick. What if there are 4 16-bit integers? Obviously with a vector register it is easy to shuffle them, but is there an efficient approach—either to reversing their order, or to doing the compare without reversing order—using only GPR?
Asked
Active
Viewed 99 times
3
-
3I would suggest adding a (possibly slow) reference implementation of the functionality for which you seek a high-performance solution. This will allow easy checking of the functional correctness of any proposed solutions, as well as a performance base line. – njuffa May 15 '22 at 07:12
-
If you first reverse the bytes with `bswap`, you then just need to reverse each pair of adjacent bytes. You could do that with a shift right by 8 and mask off the odd bytes, shift left by 8 and mask off the even bytes, and then OR the results together. – Nate Eldredge May 15 '22 at 07:51
-
Or switch to ARM64 and do it in two instructions: `rev x0, x0 ; rev16 x0, x0`. :) – Nate Eldredge May 15 '22 at 07:58
-
1[CMPSW](https://faydoc.tripod.com/cpu/cmpsw.htm) is fairly efficient, if the operands are in memory. – TonyK May 15 '22 at 12:16
-
@TonyK: If you're willing to compare 1 `short` at a time, `cmp ax, dx` is more efficient than setting up pointers for `cmpsw`. Or `movzx eax, [rsi+2]` / `cmp ax, [rdi+2]` to do in 2 uops what `cmpsw` does in 5 on Skylake, 7 on Alder Lake-P. (For small fixed-length buffers, I'm assuming loop unrolling so the pointer-increment part of cmpsw isn't necessary. Or just use indexed addressing modes with movzx/cmp, at worst defeating micro-fusion or maybe macro-fusion of cmp/jcc so `cmp ax, [rdi+rcx]/jcc` could be 3 uops vs. at best 1. It's [2 on SKL](https://stackoverflow.com/a/56413946/224132)) – Peter Cordes May 15 '22 at 12:41
-
1@TonyK: TL:DR: No, `cmpsw` isn't efficient on modern Intel or AMD, compared to doing it manually. `repe cmpsw` is not efficient either, with startup overhead but not having fast-strings microcode so it only goes 1 word at a time. The `repe` version may avoid branch misprediction about where the first difference is, but microcode startup overhead is almost as bad. – Peter Cordes May 15 '22 at 12:43
-
I'd guess 2x `movq xmm, reg` / 2x `pshuflw` to word-reverse / SSE4.2 `pcmpgtq` could be worth trying. Or SSE2 `pcmpeqw` / `pmovmskb` / `not ax` / `bsr` to find the first difference? – Peter Cordes May 15 '22 at 12:48
-
@PeterCordes: Yes, I meant `repe compsw`. Thank you for catching that. (But don't you have to say `movzx eax, word ptr[rsi+2]` etc?) – TonyK May 15 '22 at 12:50
-
Moonchild, where are these four 16-bit integers? In memory, or registers? – TonyK May 15 '22 at 12:52
-
@TonyK: Oh yes, MOVZX needs an explicit source size. Apparently I tend to forget that when writing SO comments and thinking just about performance, since both sizes of movzx perform the same. :P (https://uops.info/) – Peter Cordes May 15 '22 at 12:58
-
@PeterCordes: I wasn't thinking about performance, I was thinking about saving the OP some trouble if they try to use your suggestion. – TonyK May 15 '22 at 13:12
-
@TonyK: Yes, I know, thanks for catching my mistake. ecm had to correct me on the same omission in another comment within the past couple weeks, so I was just musing about how I managed to get sloppy. – Peter Cordes May 15 '22 at 13:14
-
They are in memory. One thing I was considering is maintaining a shortish buffer of pre-reversed strings, computed using simd; but it would be nice not to have to bother. – Moonchild May 15 '22 at 20:58
1 Answers
1
For the reverse alone, here's my attempt:
wswap2:
;; rdi = ABCD (words)
mov rax, rdi
ror edi, 16 ; rdi = 00DC
shl rdi, 32 ; rdi = DC00
shr rax, 32 ; rax = 00AB
ror eax, 16 ; rax = 00BA
or rax, rdi ; rax = DCBA
ret
It's convenient to be able to use a 32-bit rotate to swap two adjacent words.
We've got two parallel dependency chains of two uops each, followed by one more uop to merge them.

Nate Eldredge
- 48,811
- 6
- 54
- 82
-
You could save 1 byte of code size by using `mov eax,edi` and doing low-half stuff on RAX. Dep-chain lengths are equal so lack of mov-elimination would hurt equally, and you can still schedule the work RDI right after the copy. Or with BMI2, `rorx eax, edi, 16` instead of `mov`/`ror`. (To use rorx as a copy-and-shift, `rorx rax, rdi, 32` to get the high half and let a 32-bit operand-size rotate zero-extend, clearing the high garbage.) – Peter Cordes May 15 '22 at 17:21
-
This is good for latency. If you're only doing this for one pair of input numbers before a `cmp` (8 total shift/rotate ops), ALU throughput bottlenecks probably won't be a factor. (2/clock shifts and rotate on modern Intel, vs. 4/clock on Zen). It's 6 uops for the front-end, so if that's more important than latency, the thing to compare against is probably `movq xmm0, rdi` / `pshuflw xmm0, 0b00_01_10_11` / `movq rax, xmm0`. (With SSE4.2 `pcmpgtq` as an option instead of copying back to integer, if signed compare works.) On Intel, 2 of the 3 uops are for port 5 (shuffle and `movq xmm, r64`). – Peter Cordes May 15 '22 at 17:31