6

I would like to know if performing a logical right shift is faster when shifting by a power of 2

For example, is

myUnsigned >> 4

any faster than

myUnsigned >> 3

I appreciate that everyone's first response will be to tell me that one shouldn't worry about tiny little things like this, it's using correct algorithms and collections to cut orders of magnitude that matters. I fully agree with you, but I am really trying to squeeze all I can out of an embedded chip (an ATMega328) - I just got a performance shift worthy of a 'woohoo!' by replacing a divide with a bit-shift, so I promise you that this does matter.

phuclv
  • 37,963
  • 15
  • 156
  • 475
Will
  • 79
  • 1
  • 3
  • 10
    Why don't you measure yourself? – Tadeusz A. Kadłubowski Sep 16 '10 at 11:48
  • 9
    Who cares if `x >> 4` is faster than `x >> 3`? They have different semantics, so it does not matter which one is faster. Anyway, I have never encountered an architecture where the right operand of a bit shift operator had any performance impact. – fredoverflow Sep 16 '10 at 11:54
  • 9
    @FredOverflow: On the ATMega, the bit-shift instruction doesn't take a "number of bits to shift" operand. Regarding `x >> 4` versus `x >> 3` -- maybe the OP has some liberties here (e.g. doing fixed-point arithmetic and has a certain amount of latitude in how large the fractional component is) – Martin B Sep 16 '10 at 11:57
  • Division is famous for being extremely expensive (roughly 40 cycles on a modern desktop processor that can do several shifts in one cycle, more than that because it is implemented in software to add insult to injury on an embedded chip). – Pascal Cuoq Sep 16 '10 at 11:58
  • @Martin: Interesting, I hadn't considered that. – fredoverflow Sep 16 '10 at 12:11
  • Joey has a good point, by the way - for an unsigned type, your compiler "should" be able to optimise for you, and to replace `myUnsigned / 16` with `myUnsigned >> 4` if that is faster. But apparently it didn't. You might find some overall performance improvements just by checking your compiler options. – Steve Jessop Sep 16 '10 at 12:17
  • 1
    @FredOverflow: I'd guess he's trying to implement some kind of fixed-point integer. In that case, he can switch between `x >> 4` and `x >> 3`, just be changing the number of fraction bits, as long as the precision/range are still good enough for the application. – Niki Sep 16 '10 at 12:18
  • Fred is entirely correct, I replaced all my instances of floating point numbers (in the critical code) with integers. These integers needed to be multiplied by some coefficient to spread ~0.0-10.0 over a large range of ints. I have control over this multiplicand, and can set it semi-arbitrarily (as long as it is large). New scam: Replacing a bit shift by 9 with a downcast (little-endian, no pointer math needed) and a bit shift of 1. – Will Sep 16 '10 at 20:46
  • Try reading the datasheet. – TomServo Feb 03 '20 at 02:19
  • @fredoverflow Same advice for you -- read the datasheet. General "never encountered an architecture" advice is less relevant than a simple look at the available opcodes available. I this case, your general knowledge is wrong. – TomServo Feb 03 '20 at 02:23

9 Answers9

21

Let's look at the datasheet:

http://atmel.com/dyn/resources/prod_documents/8271S.pdf

As far as I can see, the ASR (arithmetic shift right) always shifts by one bit and cannot take the number of bits to shift; it takes one cycle to execute. Therefore, shifting right by n bits will take n cycles. Powers of two behave just the same as any other number.

Martin B
  • 23,670
  • 6
  • 53
  • 72
  • Thank you! I had to replace a floating point number with an integer, but this had to be multiplied larger in order to keep precision. I am trying to find an ideal coefficient so that I spend the least possible time crunching the int back down to the unmultiplied size. – Will Sep 16 '10 at 12:03
5

In the AVR instruction set, arithmetic shift right and left happen one bit at a time. So, for this particular microcontroller, shifting >> n means the compiler actually makes n many individual asr ops, and I guess >>3 is one faster than >>4.

This makes the AVR fairly unsual, by the way.

Crashworks
  • 40,496
  • 12
  • 101
  • 170
  • it's not unusual. Most (if not all) 8-bit microcontrollers don't have a barrel shifter and must shift one bit at a time – phuclv Oct 08 '15 at 12:31
  • [8086 and 80286](https://stackoverflow.com/a/61034866/995714) don't have a barrel shifter either, so the farther the shift distance the slower it gets – phuclv Apr 07 '20 at 01:42
4

Indeed ATMega doesn't have a barrel shifter just like most (if not all) other 8-bit MCUs. Therefore it can only shift by 1 each time instead of any arbitrary values like more powerful CPUs. As a result shifting by 4 is theoretically slower than shifting by 3

However ATMega does have a swap nibble instruction so in fact x >> 4 is faster than x >> 3

Assuming x is an uint8_t then x >>= 3 is implemented by 3 right shifts

x >>= 1;
x >>= 1;
x >>= 1;

whereas x >>= 4 only need a swap and a bit clear

swap(x);    // swap the top and bottom nibbles AB <-> BA
x &= 0x0f;

or

x &= 0xf0;
swap(x);

For bigger cross-register shifts there are also various ways to optimize it

With a uint16_t variable y consisting of the low part y0 and high part y1 then y >> 8 is simply

y0 = y1;
y1 = 0;

Similarly y >> 9 can be optimized to

y0 = y1 >> 1;
y1 = 0;

and hence is even faster than a shift by 3 on a char


In conclusion, the shift time varies depending on the shift distance, but it's not necessarily slower for longer or non-power-of-2 values. Generally it'll take at most 3 instructions to shift within an 8-bit char

Here are some demos from compiler explorer

  • A right shift by 4 is achieved by a swap and an and like above

      swap r24
      andi r24,lo8(15)
    
  • A right shift by 3 has to be done with 3 instructions

      lsr r24
      lsr r24
      lsr r24
    

Left shifts are also optimized in the same manner

See also Which is faster: x<<1 or x<<10?

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • Hm, I didn't know that x << 3 in AVR is implemented as 3 shifts. Are you sure that AVR has a special OP for a single bit shift? On ARM swap and <<3 would take the same time (1 cycle) – Mixaz Nov 10 '14 at 20:44
  • 1
    @Mixaz no 8-bit microcontroller I know has barrel shifter, so it can only shift 1 bit each cycle. Just look for AVR, PIC or 8051 instruction set and see – phuclv Nov 11 '14 at 03:55
  • Even some 16-bit microcontrollers still have to shift 1 bit each time. The instruction set of ARV was posted in the other answers, read it first – phuclv Nov 11 '14 at 04:01
4

You have to consult the documentation of your processor for this information. Even for a given instruction set, there may be different costs depending on the model. On a really small processor, shifting by one could conceivably be faster than by other values, for instance (it is the case for rotation instructions on some IA32 processors, but that's only because this instruction is so rarely produced by compilers).

According to http://atmel.com/dyn/resources/prod_documents/8271S.pdf all logical shifts are done in one cycle for the ATMega328. But of course, as pointed out in the comments, all logical shifts are by one bit. So the cost of a shift by n is n cycles in n instructions.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • 1
    Beware: The shift instructions always shifts by only one bit... so the further you shift, the longer it takes. – Martin B Sep 16 '10 at 11:57
  • @Martin B Thanks for pointing out, I should have noticed it, the information was available in the same PDF. – Pascal Cuoq Sep 16 '10 at 12:01
  • The ATMega has a nibble swap instruction, so `Rd << 4` may be implemented as `SWAP Rd; ORI Rd, 0xF0` and will be faster than `Rd << 3` – phuclv Jun 15 '14 at 03:27
2

It depends on how the processor is built. If the processor has a barrel-rotate it can shift any number of bits in one operation, but that takes chip space and power budget. The most economical hardware would just be able to rotate right by one, with options regarding the wrap-around bit. Next would be one that could rotate by one either left or right. I can imagine a structure that would have a 1-shifter, 2-shifter, 4-shifter, etc. in which case 4 might be faster than 3.

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
2

Disassemble first then time the code. Dont be discouraged by people telling you, you are wasting your time. The knowledge you gain will put you in a position to be the goto person for putting out the big company fires. The number of people with real behind the curtain knowledge is dropping at an alarming rate in this industry.

Sounds like others explained the real answer here, which disassembly would have shown, single bit shift instruction. So 4 shifts will take 133% of the time that 3 shifts took, or 3 shifts is 75% of the time of 4 shifts depending on how you compared the numbers. And your measurements should reflect that difference, if they dont I would continue with this experiment until you completely understand the execution times.

old_timer
  • 69,149
  • 8
  • 89
  • 168
1

If your targer processor has a bit-shift instruction (which is very likely), then it depends on the hardware-implementation of that instruction if there will be any difference between shifting a power-of-2 bits, or shifting some other number. However, it is unlikely to make a difference.

Bart van Ingen Schenau
  • 15,488
  • 4
  • 32
  • 41
0

With all respect, you should not even start talking about performace until you start measuring. Compile you program with division. Run. Measure time. Repeat with shift.

danatel
  • 4,844
  • 11
  • 48
  • 62
  • 2
    Given that he's already measured a performance improvement by replacing div with shift, I think it's fairly evident that he's been running timings. – Crashworks Sep 16 '10 at 12:03
  • AFAIK it is a widely known matter about computer calculations that shift OPs are much faster than multiplication ones, division is slower than multiplication (they slower even on paper)). Addition/subtraction are almost as fast as shifts - just in theory they use a bit more transistors, but it doesn't matter and CPU executes them in single cycle anyway. multiplication and division take more cycles – Mixaz Nov 10 '14 at 20:57
  • multiplication and division take more cycles, since they use addition/subtraction internally in subsequent iterations. I recall that ARM specs (at least for old versions) stated that division (I do not remember about multiplication) may take different time, because of that – Mixaz Nov 10 '14 at 21:03
0

replacing a divide with a bit-shift

This is not the same for negative numbers:

char div2 (void)
{
    return (-1) / 2;
    // ldi r24,0
}

char asr1 (void)
{
    return (-1) >> 1;
    //  ldi r24,-1
}
emacs drives me nuts
  • 2,785
  • 13
  • 23