0

Some CPUs like the MSP430 don't have multi-bit shifts, but only single-bit shifts or rotate instructions. This makes me curious about how programmers in the "olden days" implemented multi-bit shifts, when all they could do is shift one bit at a time.

I am aware of a "dumb" way of doing it, which is this:

#include <cstdint>

uint64_t lshift(uint64_t x, uint64_t shift) {
    for (uint64_t i = 0; i < shift; ++i) {
        x <<= 1;
    }
}

Is there any way of doing it which does not have O(n) complexity? Or is there at least an implementation that makes it possible if I know the shift at compile time, which is often the case with bit-shifts?

My intuition is that x << (1 << (1 << 1)) is the same as x << 4, so maybe one could reduce it down to O(log n) by combining shifts like that.

Edit

My intuition was wrong, but other operations could produce a similar effect. x << 1 is equivalent to x += x so x += x, x += x, x += x is equivalent to x << 4. Multiplication with powers of two would also work.


Note: C++ is only being used for the sake of convenience here, I know that there will always be a left-shift operator. I just don't want to think about this in MP430 assembly.

Jan Schultke
  • 17,446
  • 6
  • 47
  • 96
  • If it does not support multi-bit shifts, I am pretty sure it ain't possible to do so. This is like multithreading on a single core CPU – kesarling He-Him Sep 04 '20 at 15:19
  • No. You can't shift N bits left without N single-bit shifts. – user253751 Sep 04 '20 at 15:19
  • 2
    It's unlikely if you don't have a multi bit shift shift, but do you have a multiply? – Colin Sep 04 '20 at 15:20
  • @Colin if I had a multiply, would that change anything? Multiplying by 8 gives me a 3-bit shift, but how do I go from "I have to shift 3 bits" to "I have an 8". Usually that requires (1 << 3), which is not allowed. – Jan Schultke Sep 04 '20 at 15:22
  • I don't see the usefulness of dealing with these hypotheticals, but you could always rewrite a left bit shift into a multiplication by the proper power of two. If the number to shift by isn't a constant, you could look up the corresponding power of two in a table. Otherwise it's `(1ull<<(SHIFT))` (an integer constant expression, foldable at compile time). – Petr Skocik Sep 04 '20 at 15:23
  • @JanSchultke have a look up table of shift amount to multiplier, something like that – Colin Sep 04 '20 at 15:23
  • 4
    You are overthink this. C++ and C is defined as an abstract machine on which bit shift operator is well defined. How this is done on specific platform (like MSP430) is compiler responsibility. You do not have to do anything on C/C++ level. The only case when you have to do it by own code is when you do this on type which is not build in. – Marek R Sep 04 '20 at 15:26
  • @JanSchultke in last edit you should specify this in tittle or top of the question that C/C++ here is just way to show effective algorithm how this is done on MSP430 and C/C++ is just abstract langue to avoid assembly (most probably you should remove C++ tag). Anyway take look how this is done by gcc MSP430 https://godbolt.org/z/6Wc4YT Note that this is so complex problem that there is a predefined function for that which is not inlined even for `-O3`. – Marek R Sep 04 '20 at 15:40
  • @MarekR I would be curious to know how that builtin function is really implemented. – Jan Schultke Sep 04 '20 at 15:46
  • [Google search has only](https://www.google.com/search?q="_ashldi3"+MSP430) 16 results. So far I've failed to find actual implementation. – Marek R Sep 04 '20 at 15:52
  • You could unroll the inner loop. This would eliminate an increment instruction, a comparison instruction and a branch instruction. Worst case, 31 shifts. – Thomas Matthews Sep 04 '20 at 16:13
  • Your intuition is off. For one thing, `(1 << 1)` is `2`, not `4`. Secondly, what is `x << (1 << 1)` supposed to mean when you have restricted yourself to the case where the right-side operator to `<<` must be `1`? Based on your restrictions, `x << (1 << 1)`, which evaluates to `x << 2`, is illegal, no? – JaMiT Sep 04 '20 at 16:59
  • @JaMiT yeah, I goofed that one up. I fixed it so the expression is now equivalent to `x << 4`, but it still doesn't change the other problem you addressed. – Jan Schultke Sep 04 '20 at 17:50
  • @MarekR it'll be inlined if the shift count is constant – phuclv Sep 14 '20 at 22:43
  • @phuclv it is not https://godbolt.org/z/E7zr3P Maybe you missed that it is about MP430 platform. – Marek R Sep 14 '20 at 23:19
  • @MarekR no **it is**. See my answer. In your link it isn't inlined because **you used `uint64_t`** but MSP430 is a 16-bit MCU so 64-bit operations are far more complex for it and has to be done via a library call – phuclv Sep 14 '20 at 23:22

3 Answers3

2

For background information about the following code, search the internet for "Duff's Device".

You could use a switch statement with fall through:

uint32_t Shift_Value(uint32_t value, unsigned int shift_quantity)
{
  switch (shift_quantity)
  {
     case 31:
         value <<= 1;
     case 30:
         value <<= 1;
     case 29:
         value <<= 1;
// ...
     case  1:
         value <<= 1;
   }
   return value;
}

The above code is interesting because it is a jump table into an array of shift operations. It can be compared to unrolling a for loop, but it has an advantage that the execution jumps into the appropriate location for the "unrolling".

I've used this pattern before in embedded systems to increase performance.

I recommend printing out the compiler generated assembly language and studying the assembly language. :-)

Also, the optimization could be O(1) since there are no loops, only a calculation and jump.

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
2

TL;DR: Indeed shifts by multiple steps would generally be done by multiple shifts as you can imagine. But some tricks can be used to avoid shifting too many times. For example some algorithms are designed so that only shifts by 1 is needed, or if a bigger shift is required then some special bitwise instructions in the ISA can be used for optimization

Programming for an embedded system requires deeper understanding about that architecture in order to achieve good performance and RAM/ROM usage. For example variables are chosen so that they fit in the machine word. No one would use uint64_t like that on a 16 or 8-bit MCU unless absolutely necessary. Operations on multi-word variables (include bit shifts) require much more instructions so they won't usually be inlined. In general for shifts that are a multiple of a word size are done similarly to shifting array elements so they'll be very fast


Shifting by an arbitrary variable requires a barrel shifter which takes far more die space than a simple shift register, therefore most embedded architectures can only shift one bit at a time. Most of them also include some kind of swap halfword instruction to overcome the limitation and allow fast enough large-distance shifting. For example 8-bit microcontrollers like 8051, PIC or AVR has a swap-nibble instruction. See:

The MSP430 is a 16-bit MCU so it has a SWPB instruction to swap bytes which can be used similarly for fast shifting by 8. Here are some examples generated by Clang (with comments by me, notice how the shift by 8 and shifts larger than 8 are done):

shift_left_15(unsigned short):                     ; @shift_left_15(unsigned short)
        mov.b   r12, r12
        swpb    r12      ; swap bytes then shift left 7 times
        add     r12, r12
        add     r12, r12
        add     r12, r12
        add     r12, r12
        add     r12, r12
        add     r12, r12
        add     r12, r12
        ret
shift_left_12(unsigned short):                     ; @shift_left_12(unsigned short)
        mov.b   r12, r12
        swpb    r12      ; swap bytes then shift left 4 times
        add     r12, r12
        add     r12, r12
        add     r12, r12
        add     r12, r12
        ret
shift_left_10(unsigned short):                     ; @shift_left_10(unsigned short)
        mov.b   r12, r12
        swpb    r12      ; swap bytes then shift left 2 times
        add     r12, r12
        add     r12, r12
        ret
shift_left_9(unsigned short):                      ; @shift_left_9(unsigned short)
        mov.b   r12, r12
        swpb    r12
        add     r12, r12
        ret
shift_left_8(unsigned short):                      ; @shift_left_8(unsigned short)
        mov.b   r12, r12
        swpb    r12      ; just swap bytes
        ret
shift_left_7(unsigned short):                      ; @shift_left_7(unsigned short)
        add     r12, r12
        add     r12, r12
        add     r12, r12
        add     r12, r12
        add     r12, r12
        add     r12, r12
        add     r12, r12
        ret
shift_left_3(unsigned short):                      ; @shift_left_3(unsigned short)
        add     r12, r12
        add     r12, r12
        add     r12, r12
        ret

You can open the Godbolt link above for the full output

If you're on MSP430X then it has the ability to shift from 1 to 4 bit position which greatly simplifies the shifting procedure

shift_left_15(unsigned short):
        PUSHM.W #1, R4
        MOV.W   R1, R4
        rpt     #15 { rlax.w      R12
        POPM.W  #1, r4
        RET
shift_left_12(unsigned short):
        PUSHM.W #1, R4
        MOV.W   R1, R4
        rpt     #12 { rlax.w      R12
        POPM.W  #1, r4
        RET
shift_left_10(unsigned short):
        PUSHM.W #1, R4
        MOV.W   R1, R4
        rpt     #10 { rlax.w      R12
        POPM.W  #1, r4
        RET
shift_left_9(unsigned short):
        PUSHM.W #1, R4
        MOV.W   R1, R4
        rpt     #9 { rlax.w       R12
        POPM.W  #1, r4
        RET
shift_left_8(unsigned short):
        PUSHM.W #1, R4
        MOV.W   R1, R4
        rpt     #8 { rlax.w       R12
        POPM.W  #1, r4
        RET
shift_left_7(unsigned short):
        PUSHM.W #1, R4
        MOV.W   R1, R4
        rpt     #7 { rlax.w       R12
        POPM.W  #1, r4
        RET
shift_left_3(unsigned short):
        PUSHM.W #1, R4
        MOV.W   R1, R4
        rpt     #3 { rlax.w       R12
        POPM.W  #1, r4
        RET

Right shift can be done in the same way but replace add with rrc/rra and rlax with rrax. See demo on Godbolt

phuclv
  • 37,963
  • 15
  • 156
  • 475
1

If you have a multiplier, then

uint32_t multipliers[] = {1,2,4,8,16 ...};
uint32_t shift(uint32_t x, uint32_t shift)
{
    return x * multipliers[shift];
}
Colin
  • 3,394
  • 1
  • 21
  • 29
  • Will that be a single instruction or will it still boil down to n single bit shifts? – User 10482 Sep 04 '20 at 15:30
  • That would depend on whether you have a hardware multiplier, and how the compiler works it out, it looks like some MSP430 variants do indeed have a multiplier as a peripheral, so I would hope a decent compiler would use it rather than the limited single bit shift – Colin Sep 04 '20 at 15:31
  • That's really funny because this optimization is usually done in reverse – harold Sep 05 '20 at 04:18
  • this only works if there are very fast multipliers (unlikely in such embedded systems) – phuclv Sep 14 '20 at 14:30