Is logical/arithmetic shift fewer bits faster?

Question

Is x>>2 faster than x>>31? In other words, is sar x, 2 faster than sar x, 31? I did some simple test, they seem to have the same speed. I would appreciate any solid evidence.

The answer to this question may depend on the specific architecture being targeted. You should tag the question with whichever architecture you are asking about. For example `x86`. — François Andrieux, Apr 13 '21 at 20:06
For most processors they will be exactly the same speed. But if you're concerned you should run some benchmarks yourself. — Mark Ransom, Apr 13 '21 at 20:07
Depends on the internal architecture of the processor. It may have a *barrel shifter* which can shift many bits many places. Also depends on when the clock pulse occurs. For example, if there is only one bit shit per clock or whether the processor can shift many bits in a clock cycle. — Thomas Matthews, Apr 13 '21 at 20:11
You're looking at microoptimization. The savings may be in terms of nanoseconds, or 100s of nanoseconds. Ask yourself if this is significant. — Thomas Matthews, Apr 13 '21 at 20:13
Thank you for the fast replies. So it is never slower, right? Like Speed of (sar x, 2 ) <= Speed of ( sar x, 31 ). — SheldonCao, Apr 13 '21 at 20:14
@ThomasMatthews when is the last time you saw a processor without a barrel shifter? They were common at least 30 years ago. — Mark Ransom, Apr 13 '21 at 20:14
I recall that the 8086 took time for each bit shifted, and so did the 186. The 186 masked shift counts with 31 to limit the duration of handling a single instruction. — ecm, Apr 13 '21 at 20:15
x>>31 could be faster on some ancient processors because it only has to check one bit. E.g. shift left with carry; set register to 0; subtract with borrow. x>>31 could be slower on some ancient processors because it has to shift right by 1 bit, 31 separate times. — user253751, Apr 13 '21 at 20:15
@MarkRansom: Remember, the 8051 and its brethren are still in use. :-) There may be some custom micros out there. I can't say that all micros have barrel shifters. :-( — Thomas Matthews, Apr 13 '21 at 20:17
@ThomasMatthews • your typo made me laugh out loud! I needed a good chuckle, thank you. — Eljay, Apr 13 '21 at 20:35
@ThomasMatthews: 100 ns is 400 clock cycles on a modern x86 (at 4GHz), enough time to run about 1600 instructions (if it's not stalling on any memory or latency bottlenecks). If just one instruction could be that slow, you'd surely have heard about it as being vastly worse than even division. Like worst case for x87 `FYL2X` (`y * log2(x)`) on Haswell is 680 cycles, but best case is 58 cycles. (https://agner.org/optimize/). 64-bit Integer div is really slow, and that's 21 to 74 cycle throughput, vs. 8 or 9 cycle for 32-bit integer division, or FP div / sqrt. — Peter Cordes, Apr 13 '21 at 22:43
@ThomasMatthews barrel shifter is very large compared to a shift register so no microcontrollers I know have barrel shifter: [Is a logical right shift by a power of 2 faster in AVR?](https://stackoverflow.com/a/17908514), [Bitshifts in MSP430](https://stackoverflow.com/a/63886220) — phuclv, Apr 14 '21 at 00:55
@Peter Cordes: That's incorrect. Shift count masking is used to detect 186 level support. This happens to fail for the NEC V30 but other 186s do mask it. [Here's a source](https://stackoverflow.com/questions/61745808/why-any-modern-x86-masks-shift-count-to-the-5-low-bits-in-cl/61749420#61749420) where the following document is referenced: "80186/80188 80C186/80C188 Hardware Reference Manual" — ecm, Apr 14 '21 at 07:18
@ecm: xD, my own answer on that question starts with "Despite what Intel's current manuals say, masking the shift count was new in 186," and cites the same 186-detection method. :P Deleted that previous comment linking https://www.felixcloutier.com/x86/sal:sar:shl:shr#ia-32-architecture-compatibility. Perhaps Intel doesn't count 186 as part of the mainstream x86 series. The phrasing is already suspicious "all other IA-32 processors (starting with the Intel 286 processor)" - so 286 is an IA-**32**? — Peter Cordes, Apr 14 '21 at 07:28

score 7 · Accepted Answer · answered Apr 13 '21 at 21:47

7

It would depend on the hardware implementation. For common operations involving a shift by a constant amount (e.g. pointer arithmetic), there could be a faster path (e.g. it might be fused with a related addition operation). For shifting by a variable, a barrel shifter circuitry is used, where any shift amount would have the same delay.

answered Apr 13 '21 at 21:47

mkayaalp

2,631
10
17

2

https://uops.info/ and https://agner.org/optimize/ have numbers for actual x86 CPU instructions. Pentium 4 notoriously had slow shifts, but still fixed latency/throughput (not data-dependent). Most CPUs have 1-cycle latency for any shift count. (On modern Intel CPUs, compile-time-constant shifts are great, but when the count is a runtime variable, `shr reg, cl` decodes to 3 uops [because of x86 legacy baggage with not updating FLAGS if the count was 0](https://stackoverflow.com/a/36510865/224132). Unless you let the compiler use BMI2 `shlx` / `shrx`. Still, latency is only 1 cycle.) – Peter Cordes Apr 13 '21 at 22:48
2

Intel as early as 386SX had a barrel shifter: https://media.digikey.com/pdf/Data%20Sheets/Intel%20PDFs/Intel386%20SX.pdf#page=84 lists cycle counts for shift/rotate of a register as 3 cycles for shift-by-1, shift-by-CL, or shift-by-immediate. (vs. 2 cycles for an instruction like `add reg,reg`). The last Intel x86 to have shift performance that depended on the count seems to be 286: https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has a table for 8088 .. Pentium. **8088 was 8 + 4n, 186 and 286 were 5 + n. 386 was a fixed 3 cycles**. – Peter Cordes Apr 14 '21 at 03:01

Is logical/arithmetic shift fewer bits faster?

1 Answers1