85

I don't want to optimize anything, I swear, I just want to ask this question out of curiosity. I know that on most hardware there's an assembly command of bit-shift (e.g. shl, shr), which is a single command. But does it matter (nanosecond-wise, or CPU-tact-wise) how many bits you shift. In other words, is either of the following faster on any CPU?

x << 1;

and

x << 10;

And please don't hate me for this question. :)

Armen Tsirunyan
  • 130,161
  • 59
  • 324
  • 434
  • 1
    As a sidenote: I recently recognized that shifting left and shifting right not necessarily consume the same cpu-time. In my case shifting right was much slower. First i was surprised but i think the answer is that shifting left means logical and shifting right maybe means arithmetical: http://stackoverflow.com/questions/141525/absolute-beginners-guide-to-bit-shifting – Christian Ammer Nov 23 '10 at 20:47

9 Answers9

86

Potentially depends on the CPU.

However, all modern CPUs (x86, ARM) use a "barrel shifter" -- a hardware module specifically designed to perform arbitrary shifts in constant time.

So the bottom line is... no. No difference.

nimrodm
  • 23,081
  • 7
  • 58
  • 59
  • 12
    Errr - VERY MUCH depends on the processor. On some processors this is constant time. On others it can be one cycle per shift (I once used a shift by about 60,000 places as a way of s/w measuring the processor clock speed). And on other processors, there may only be instructions for single bit shifts in which case a multi-bit shift is delegated to a library routine which sits in a loop iterating away. – quickly_now Nov 20 '10 at 23:57
  • 5
    @quickly_now: That sure is a bad way of measuring clock speed. No processor is stupid enough to actually do 60,000 shifts; that'll simply be converted to `60000 mod register_size`. For example, a 32-bit processor will just use the 5 least significant bits of the shift count. – casablanca Nov 21 '10 at 01:13
  • 4
    The inmos transputer had a shift operator that took the number of shifts are a 32 bit operand. You could do 4 billion shifts if you wanted to, at 1 clock each. "No processor is stupid enough". Sorry - wrong. This one did. You DID need to code that part in assembler though. The compilers did a sensible modification / optimisation (just set the result to 0, dont do anything). – quickly_now Nov 21 '10 at 01:51
  • 1
    The 80386 had a barrel shifter for all of the shifts where the shifted in value did not come from the carry flag, but would run the rotate-through-carry instructions as an iterated number of steps. The maximum value would have been 255, however. I wouldn't be surprised if all 80x86 processors behave at least somewhat similarly, though they might perform more than one bit at a time. Given that the rotate-through-carry can do up to 255 shifts, and the behavior is mod-33 rather than mod-32, I would not expect single-cycle behavior. – supercat Nov 21 '10 at 05:52
  • 5
    Pentium 4 lost the barrel shifter, sadly, which contributed to its overall poor instructions-per-clock rate. I assume the Core Blah architecture got it back. – Russell Borogove Nov 21 '10 at 06:45
  • @RussellBorogove: its expectable because P4 is netburst and Core is P6 which is older, therefore surely falling back on good old solid grounds. – v.oddou Dec 19 '13 at 01:47
  • 1
    @RussellBorogove: On P4, `shl` performance was still independent of the count, just slow (4 cycle latency, or 1c on Prescott: http://agner.org/optimize/). But a smart compiler would left shift by 1 using an `add x,x` (0.5c latency on the double-pumped ALUs before Prescott). Right shift by 1 doesn't have any such shortcut, though. – Peter Cordes Aug 29 '17 at 01:39
65

Some embedded processors only have a "shift-by-one" instruction. On such processors, the compiler would change x << 3 into ((x << 1) << 1) << 1.

I think the Motorola MC68HCxx was one of the more popular families with this limitation. Fortunately, such architectures are now quite rare, most now include a barrel shifter with a variable shift size.

The Intel 8051, which has many modern derivatives, also cannot shift an arbitrary number of bits.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • 12
    Still common on embedded microcontrollers. – Ben Jackson Nov 20 '10 at 18:59
  • 4
    What you mean under "rare"? Accordingly to statistics number of 8-bit microcontrollers sold is greater than number of all other types of MPUs. – Vovanium Nov 20 '10 at 19:25
  • 8-bit microcontrollers are not being used much for new development, when you can get 16-bit for the same price per unit (e.g. MSP430 from TI) with more program ROM, more working RAM, and more capability. And even some 8-bit microcontrollers have barrel shifters. – Ben Voigt Nov 20 '10 at 21:10
  • 1
    The word size of a microcontroller has nothing to do with whether it has a barrel shifter, the MC68HCxx family I mentioned has 16-bit processors as well, all of them shift only a single bit position at once. – Ben Voigt Nov 20 '10 at 21:16
  • Fact that most 8-bit MCUs have no barrel shifter, although you're right that there are ones for thich it is not true, and there are non 8-bit without barrel shifter. Bitness got as an reliable approximation for machines with[out] barrel shifter. Also fact that CPU core for MCU often does not set a choice for model, but on-chip peripherials is do. And 8-bit are often chosen for more rich peripherials for the same price. – Vovanium Nov 20 '10 at 21:36
  • I suspect that the existence of a barrel shifter tends to correlate with register width because the cost of not having one is much worse on a 32-bit processor than on an 8-bit one. – supercat Nov 20 '10 at 21:53
  • 8-bit microcontrollers are still commonly used, mainly PIC and AVR in Arduino, which don't have barrel shifters. But they (and also 8051) have nibble swap instructions, allowing [optimizations for high bitshift counts](https://stackoverflow.com/a/17908514/995714) – phuclv Sep 13 '18 at 09:46
  • [8086 and 80286](https://stackoverflow.com/a/61034866/995714) don't have a barrel shifter either – phuclv Apr 07 '20 at 01:45
29

There are many cases on this.

  1. Many hi-speed MPUs have barrel shifter, multiplexer-like electronic circuit which do any shift in constant time.

  2. If MPU have only 1 bit shift x << 10 would normally be slower, as it mostly done by 10 shifts or byte copying with 2 shifts.

  3. But there is known common case where x << 10 would be even faster than x << 1. If x is 16 bit, only lower 6 bits of it is care (all other will be shifted out), so MPU need to load only lower byte, thus make only single access cycle to 8-bit memory, while x << 10 need two access cycles. If access cycle is slower than shift (and clear lower byte), x << 10 will be faster. This may apply to microcontrollers with fast onboard program ROM while accessing slow external data RAM.

  4. As addition to case 3, compiler may care about number of significant bits in x << 10 and optimize further operations to lower-width ones, like replacing 16x16 multiplication with 16x8 one (as lower byte is always zero).

Note, some microcontrollers have no shift-left instruction at all, they use add x,x instead.

Vovanium
  • 3,798
  • 17
  • 23
  • i dont get it, why x << 10 is faster then x << 8 where in x << 8 you need to do a load from the lower byte from 16 bit, and not do load and two shifts. i don't get it. – none Nov 24 '10 at 09:34
  • 3
    @none: I did not state that x<<10 is faster than x<<8. – Vovanium Nov 24 '10 at 10:19
9

On ARM, this can be done as a side effect of another instruction. So potentially, there's no latency at all for either of them.

onemasse
  • 6,514
  • 8
  • 32
  • 37
  • 1
    Do the instructions execute in the same number of cycles? On a few architectures the same instruction will translate into a few different op-codes based on the operands, and take anywhere from 1 to 5 cycles. – Nick T Nov 20 '10 at 19:53
  • @Nick An ARM instruction generally takes between 1 or 2 cycles. Not sure with the newer architectures. – onemasse Nov 20 '10 at 20:01
  • 2
    @Nick T: He speaking about ARM, thich have shift not as dedicated instruction, but as 'feature' of many data processing instructions. Ie `ADD R0, R1, R2 ASL #3` adds R1 and R2 shifted 3 bits left. – Vovanium Nov 20 '10 at 20:01
9

Here's my favorite CPU, in which x<<2 takes twice as long as x<<1 :)

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • unfortunately it doesn't have a nibble swap instruction like 8051, PIC or AVR, therefore the [optimization trick](https://stackoverflow.com/a/17908514/995714) can't be used – phuclv Sep 13 '18 at 09:47
7

That depends both on the CPU and compiler. Even if the underlying CPU has arbitrary bit shift with a barrel shifter, this will only happen if the compiler takes advantage of that resource.

Keep in mind that shifting anything outside the width in bits of the data is "undefined behavior" in C and C++. Right shift of signed data is also "implementation defined". Rather than too much concern about speed, be concerned that you are getting the same answer on different implementations.

Quoting from ANSI C section 3.3.7:

3.3.7 Bitwise shift operators

Syntax

      shift-expression:
              additive-expression
              shift-expression <<  additive-expression
              shift-expression >>  additive-expression

Constraints

Each of the operands shall have integral type.

Semantics

The integral promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width in bits of the promoted left operand, the behavior is undefined.

The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 multiplied by the quantity, 2 raised to the power E2, reduced modulo ULONG_MAX+1 if E1 has type unsigned long, UINT_MAX+1 otherwise. (The constants ULONG_MAX and UINT_MAX are defined in the header .)

The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity, 2 raised to the power E2 . If E1 has a signed type and a negative value, the resulting value is implementation-defined.

So:

x = y << z;

"<<": y × 2z (undefined if an overflow occurs);

x = y >> z;

">>": implementation-defined for signed (most often the result of the arithmetic shift: y / 2z).

the wolf
  • 34,510
  • 13
  • 53
  • 71
  • I don't think `1u << 100` is UB. It is just 0. – Armen Tsirunyan Nov 21 '10 at 09:01
  • @Armen Tsirunyan: A bit shift `1u << 100` as a bit shift *may* be an overflow; `1u << 100` as arithmetic shift is 0. Under ANSI C, `<<` is a bit shift. http://en.wikipedia.org/wiki/Arithmetic_shift – the wolf Nov 21 '10 at 19:32
  • 2
    @Armen Tsirunyan: See ANSI section 3.3.7 -- **If the value of the right operand is negative or is greater than or equal to the width in bits of the promoted left operand, the behavior is undefined.** So your example is UB on any ANSI C system unless there is a 101+ bit type. – the wolf Nov 22 '10 at 00:31
  • @carrot-pot: OK, you convinced me :) – Armen Tsirunyan Nov 22 '10 at 09:45
  • Related: `x << (y & 31)` can still compile to a single shift instruction with no AND instruction, if the compiler knows the target architecture's shift instruction masks the count (like x86 does). (Preferably don't hard-code the mask; get it from `CHAR_BIT * sizeof(x) - 1` or something.) This is useful for writing a rotate idiom that compiles to a single instruction without any C UB regardless of inputs. (https://stackoverflow.com/questions/776508/best-practices-for-circular-shift-rotate-operations-in-c). – Peter Cordes Aug 29 '17 at 01:19
7

It is conceivable that, on an 8-bit processor, x<<1 could actually be much slower than x<<10 for a 16-bit value.

For example a reasonable translation of x<<1 may be:

byte1 = (byte1 << 1) | (byte2 >> 7)
byte2 = (byte2 << 1)

whereas x<<10 would be more simple:

byte1 = (byte2 << 2)
byte2 = 0

Notice how x<<1 shifts more often and even farther than x<<10. Furthermore the result of x<<10 doesn't depend on the content of byte1. This could speed up the operation additionally.

Robert
  • 2,603
  • 26
  • 25
5

On some generations of Intel CPUs (P2 or P3? Not AMD though, if I remember right), the bitshift operations are ridiculously slow. Bitshift by 1 bit should always be fast though since it can just use addition. Another question to consider is whether bitshifts by a constant number of bits are faster than variable-length shifts. Even if the opcodes are the same speed, on x86 the nonconstant righthand operand of a bitshift must occupy the CL register, which imposes additional constrains on register allocation and may slow the program down that way too.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 1
    That's Pentium 4. PPro-derived CPUs (like P2 and P3) have fast shifts. And yes, variable-count shifts on x86 are slower than they could be, unless you can use BMI2 `shlx` / `shrx` / `sarx` (Haswell and later, and Ryzen). CISC semantics (flags unmodified if count=0) hurt x86 here. `shl r32, cl` is 3 uops on Sandybridge-family (although Intel claims it can cancel one of the uops if the flag result is unused). AMD has single-uop `shl r32, cl` (but slow double-shift for extended-precision, `shld r32, r32, cl`) – Peter Cordes Aug 29 '17 at 01:27
  • 1
    Shifts (even variable-count) are only a single uop on P6-family, but reading the flag-result of `shl r32, cl` or with an immediate other than 1 stalls the *front-end* until the shift retires! (https://stackoverflow.com/questions/36510095/inc-instruction-vs-add-1-does-it-matter). Compilers know this, and use a separate `test` instruction instead of using the flag result of a shift. (But this wastes instructions on CPUs where it's not a problem, see https://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466) – Peter Cordes Aug 29 '17 at 01:28
3

As always, it depends on the surrounding code context: e.g. are you using x<<1 as an array index? Or adding it to something else? In either case, small shift counts (1 or 2) can often optimize even more than if the compiler ends up having to just shift. Not to mention the whole throughput vs. latency vs. front-end bottlenecks tradeoff. Performance of a tiny fragment is not one-dimensional.

A hardware shift instructions is not a compiler's only option for compiling x<<1, but the other answers are mostly assuming that.


x << 1 is exactly equivalent to x+x for unsigned, and for 2's complement signed integers. Compilers always know what hardware they're targeting while they're compiling, so they can take advantage of tricks like this.

On Intel Haswell, add has 4 per clock throughput, but shl with an immediate count has only 2 per clock throughput. (See http://agner.org/optimize/ for instruction tables, and other links in the tag wiki). SIMD vector shifts are 1 per clock (2 in Skylake), but SIMD vector integer adds are 2 per clock (3 in Skylake). Latency is the same, though: 1 cycle.

There's also a special shift-by-one encoding of shl where the count is implicit in the opcode. 8086 didn't have immediate-count shifts, only by-one and by cl register. This is mostly relevant for right-shifts, because you can just add for left shifts unless you're shifting a memory operand. But if the value is needed later, it's better to load into a register first. But anyway, shl eax,1 or add eax,eax is one byte shorter than shl eax,10, and code-size can directly (decode / front-end bottlenecks) or indirectly (L1I code cache misses) affect performance.

More generally, small shift counts can sometimes be optimized into a scaled index in an addressing mode on x86. Most other architectures in common use these days are RISC, and don't have scaled-index addressing modes, but x86 is a common enough architecture for this to be worth mentioning. (e.g.g if you're indexing an array of 4-byte elements, there's room to increase the scale factor by 1 for int arr[]; arr[x<<1]).


Needing to copy+shift is common in situations where the original value of x is still needed. But most x86 integer instructions operate in-place. (The destination is one of the sources for instructions like add or shl.) The x86-64 System V calling convention passes args in registers, with the first arg in edi and return value in eax, so a function that returns x<<10 also makes the compiler emit copy+shift code.

The LEA instruction lets you shift-and-add (with a shift count of 0 to 3, because it uses addressing-mode machine-encoding). It puts the result in a separate register.

gcc and clang both optimize these functions the same way, as you can see on the Godbolt compiler explorer:

int shl1(int x) { return x<<1; }
    lea     eax, [rdi+rdi]   # 1 cycle latency, 1 uop
    ret

int shl2(int x) { return x<<2; }
    lea     eax, [4*rdi]    # longer encoding: needs a disp32 of 0 because there's no base register, only scaled-index.
    ret

int times5(int x) { return x * 5; }
    lea     eax, [rdi + 4*rdi]
    ret

int shl10(int x) { return x<<10; }
    mov     eax, edi         # 1 uop, 0 or 1 cycle latency
    shl     eax, 10          # 1 uop, 1 cycle latency
    ret

LEA with 2 components has 1 cycle latency and 2-per-clock throughput on recent Intel and AMD CPUs. (Sandybridge-family and Bulldozer/Ryzen). On Intel, it's only 1 per clock throughput with 3c latency for lea eax, [rdi + rsi + 123]. (Related: Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? goes into this in detail.)

Anyway, copy+shift by 10 needs a separate mov instruction. It might be zero latency on many recent CPUs, but it still takes front-end bandwidth and code size. (Can x86's MOV really be "free"? Why can't I reproduce this at all?)

Also related: How to multiply a register by 37 using only 2 consecutive leal instructions in x86?.


The compiler is also free to transform the surrounding code so there isn't an actual shift, or it's combined with other operations.

For example if(x<<1) { } could use an and to check all bits except the high bit. On x86, you'd use a test instruction, like test eax, 0x7fffffff / jz .false instead of shl eax,1 / jz. This optimization works for any shift count, and it also works on machines where large-count shifts are slow (like Pentium 4), or non-existent (some micro-controllers).

Many ISAs have bit-manipulation instructions beyond just shifting. e.g. PowerPC has a lot of bit-field extract / insert instructions. Or ARM has shifts of source operands as part of any other instruction. (So shift/rotate instructions are just a special form of move, using a shifted source.)

Remember, C is not assembly language. Always look at optimized compiler output when you're tuning your source code to compile efficiently.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Correction: P4 wasn't slower for *large* shift counts, it was just slow at shifts in general, 4c latency but still single uop for immediate or implicit-1 shifts. No performance dependence on the count. Also, Prescott improved that to 1c latency for immediate shifts of 32-bit registers, but 7 cycle latency for 64-bit shifts :/ – Peter Cordes Apr 14 '21 at 02:59