Which architectures have modulo- / non-modulo-shifting?

Question

When you shift a value on x86 with a variable shift count (in CL) the shift count is taken modulo the bit size of the destination operand. But is there an architecture which hasn't this kind of modulo-operation, i.e. if I shift further than there are bits in the operands the operand becomes zero (if I shift non-signed) ?

Older NVIDIA GPU architectures had dedicated left shift and right shift instructions that clamp the shift count to 0...32. Newer NVIDIA GPU architectures provide a unified funnel shift instruction with modifiers that allows one to select either clamping of the shift amount (shift count 0...32) or wrapping (shift count 0...31), e.g. `shf.l.clamp.b32` and `shf.r.wrap.b32`. — njuffa, Nov 27 '21 at 09:32

score 1 · Answer 1 · answered Nov 04 '21 at 06:46

1

x86-64 scalar shifts are mod-32 for 8 and 16-bit operand-size. Only 32 and 64-bit shifts are modulo the operand-size. (Modulo at all wasn't present in 8086, so being able to shift out all the bits in a 16-bit register was perhaps a relevant backwards-compat issue for 186. But for new extensions like 386 and amd64, there was no compat question so they could just define the semantics in a way that allowed a narrower barrel shifter with no extra condition.)

As Alex mentions, x86 SIMD shifts saturate the shift count, shifting out all the bits when >= element size. For example, pslld xmm0, xmm1 (And yes, they look at the full width of the low element of the source, not just the low byte.)

ARM uses all the bits in the low byte of the shift-count register for variable-count shifts. i.e. modulo reduced to 0..255, then saturated to 0..32. (Fun fact: for immediate shifts, it is possible to encode LSR by 1..32, but only LSL by 0..31)

IDK what other ISAs do; there might be some that are like x86 SIMD shifts.

answered Nov 04 '21 at 06:46

Peter Cordes

328,167
45
605
847

1

The "low byte" behavior is for ARM32. It appears from the manual that ARM64 switched to "mod 32 / 64" like x86. Haven't tested it, though. Kind of weird - is there anybody in the world who actually *wants* the masking behavior? And wouldn't saturation only require like three more gates? (If any of bits 6-63 is set then return 0.) – Nate Eldredge Nov 04 '21 at 14:31
@NateEldredge: The only advantage I know of to masking is that it makes it easy to write UB-free C that matches the asm behaviour for any input. e.g. `x << (n&31)` can optimize to just an `shlx`. Relevant for expressing rotates where the opposite-direction shift would be out of range without masking, although on ISAs that actually have rotates that can all optimize away. [Best practices for circular shift (rotate) operations in C++](https://stackoverflow.com/q/776508) – Peter Cordes Nov 04 '21 at 20:23
Also interestingly, it looks like ARM64's SIMD shifts (`USHL` and friends) are back to the ARM32 behavior, where the low byte is used and shifts larger than word size yield zero. As well as a form `USHLQ` that is truly saturating, where if you shift a 1 bit off the left end (i.e. the "multiplication" overflowed), you get the maximum unsigned integer value (all 1 bits). – Nate Eldredge Nov 04 '21 at 20:32

score 0 · Answer 2 · edited Nov 04 '21 at 06:47

0

x86 has non-modulo shifts as well, if you consider MMX/SSE2/AVX2 psrlq, psllw, etc

edited Nov 04 '21 at 06:47

Peter Cordes

328,167
45
605
847

answered Nov 04 '21 at 06:25

Alex Guteniev

12,039
2
34
79

Which architectures have modulo- / non-modulo-shifting?

2 Answers2