How many assembly instructions are in a bit shift operation?

Question

I am trying to make my microcontroller library more efficient and the most called function in the library (can be called 1 million times a second) is where im focusing my efforts into.

In this function there are some bitshift operation. like this where bool type is coverted to 16bits and shifted 10 places

value = ((uint16_t) type) << 10

How many clock cycles / assembly instructions does a shift 10 uses?? Is it 10 instructions? or is it done in a single instruction?

As an alternative to that i am considering of doing something like this instead

if(type)
 value = 0x0400
else
 value = 0x0000

now if << 10 does take 10 operations i would assume that the later would be slightly faster by a few operations (the flash space trade off is negligible)

The program library is used in both C and C++ languages and the target microcontroller architecture is ARM and RISC

Bonus Points : Are there tools for answering questions like this where i want to test the whole function on how many assembly instructions it compiles into? for further optimatiation. Of course without digging into the whole compiled hex files

Please see [Which is faster: x<<1 or x<<10?](https://stackoverflow.com/questions/4234120/which-is-faster-x1-or-x10) which is answered: *Potentially depends on the CPU. However, all modern CPUs (x86, ARM) use a "barrel shifter" -- a hardware module specifically designed to perform arbitrary shifts in constant time. So the bottom line is... no. No difference.* — Weather Vane, Feb 22 '22 at 19:34
There are many ARM architectures. A Cortex M0 has a 2-stage-pipeline. A Cortex A-15 has 15 pipeline stages. What is RISC (Reduced Instruction Set Computing? There are lots of RISC architectures and microcontrollers since the 1960ies to today. Do you mean RISC-V? Branches typically are bad, except if they can be predicted. At least rewrite to `value = type ? 0x0400 : 0x0000;`. I would not worry about the shift instruction. The assembly instruction != number of needed cycles. Benchmark on real hardware or a simulator. — Sebastian, Feb 22 '22 at 19:56
ARM architectures can integrate shifts into other assembly instructions. See e.g. http://www.csbio.unc.edu/mcmillan/Comp411F18/Lecture07.pdf (caveat the font) — Sebastian, Feb 22 '22 at 19:59
some instruction sets can only shift one bit at a time, others can do many in one instruction. the shift can be removed for the code you showed and instead you would use an and type test instruction where you and with the bit in question then look at the zero flag (if the instruction set has flags). — old_timer, Feb 22 '22 at 19:59
@Sebastian the ternary syntax will probably generate identical code to the `if .. else`. — Weather Vane, Feb 22 '22 at 20:02
if you want to make it more efficient, then look at the generated code from an optimized compiler. while it is not difficult to find optimizations in compiled output, from the questions you are asking you need to start with the compiler output itself first. There are times you can change your high level code to improve the compiled output but until you understand what it is doing right now you cannot improve it. newer processors there is no, real, how many clocks thing as that is not deterministic. instead minimize memory accesses first then instructions — old_timer, Feb 22 '22 at 20:02
You could also multiply the bool integer representation (0 or 1) with any desired value. This only, if you get the bool value as parameter or from memory. If you calculate the bool value locally, it would possibly be stored in some predicate. https://en.wikipedia.org/wiki/Predication_(computer_architecture)#History (last paragraph). — Sebastian, Feb 22 '22 at 20:03
@WeatherVane In this case yes, but it forces one to formulate the if/else block similarly. — Sebastian, Feb 22 '22 at 20:03
I'd only worry about this if your RISC ISA is AVR or something else that can't shift by multiple positions at once. — Peter Cordes, Feb 22 '22 at 22:19
@WeatherVane: I was going to say that sometimes if-conversion to branchless only happens at `gcc -O3` not `-O2` (as in [gcc optimization flag -O3 makes code slower than -O2](https://stackoverflow.com/q/28875325)), but in this case GCC optimizes the `if()` or the ternary to a shift even at `-O1`, I guess noticing that both sides of the branch just set `value`. https://godbolt.org/z/s5nohPsq8 For that not to happen, you have to turn optimization down to `-Og` or `-O0`. And even then, both ternary and `if` still compile to the same branchy asm with GCC -Og for x86-64 or AVR. — Peter Cordes, Feb 22 '22 at 22:30
with the arm tag, there is more than one instruction set so you need to look at the one in question — old_timer, Feb 22 '22 at 23:33
You mention flash space tradeoff - some microcontrollers have slower speed, when running directly from Flash or ROM instead of copying the compiled code into (S)RAM and running from there. — Sebastian, Feb 22 '22 at 23:54
risc is not an instruction set. how arm handles this and mips and risc-v and others are not assumed to be different...by design, you have to focus on one architecture if you are trying to hand tune compiled output. again, asking this question means you are not ready to outperform the compiler. just use the compiler. — old_timer, Feb 23 '22 at 03:52
even the stm32 family that has a cache in front of the flash that you cannot disable, it is still better performance to run out of sram as flash is often at best half the speed of the processor (meaning the processor spends at least half of its time waiting on instructions) some folks for some reason think they have to run the mcu at the highest speed advertised which often causes the flash wait cycles to be much worse. the flash speed is fixed, and only recently is it approaching the processor clock speed, where sram generally is. and then there is the dont count clocks thing... — old_timer, Feb 23 '22 at 03:55
this question is so broad it is somewhere between several separate SO questions to dozens/hundreds of separate SO questions. Please ask one question at a time, with examples. — old_timer, Feb 23 '22 at 03:56

score 2 · Answer 1 · answered Feb 22 '22 at 19:34

2

It depends on the cpu and the compiler. You can experiment with the Compiler Explorer. On x86_64, it's a single instruction.

sal     eax, 10

https://gcc.godbolt.org/z/bbx45od8W

answered Feb 22 '22 at 19:34

Olsonist

2,051
1
20
35

2

Fun fact: GCC and clang compile to this even if you write it as `unsigned short value = type ? 0x0400 : 0x0000;` or an equivalent `if(type)`. It's probably a good thing to write it as a shift in the source, though. https://godbolt.org/z/9qMfGW3b8 ARM64 GCC compiles all 3 ways to `ubfiz w0, w0, 10, 6`, although ARM64 clang uses tst/csel for all 3 ways :/ – Peter Cordes Feb 22 '22 at 22:36

Jake 'Alquimista' LEE · Answer 2 · 2022-02-23T06:29:53.643

2

I hope this function to be a static inline one or even a macro. Otherwise, the function call overhead will meaninglessly devastate the performance beyond rescue.

And let's focus on arm since you put the arm tag:
Remove the typecasting if you are 100% sure type fits into 16bits : typecasting is usually 'free of charge', but depending on the context and toolchain, the compiler could put some masking instructions such as uxth prior to bit operations that eat up cycles for nothing.
Frankly said, you shouldn't write any separate function for simple barrel-shifter operations in the first place: simply embed them in your code instead.

Barrel shifters are DEFINITELY not a concern on ARM, and they are never slower than any other ALU instructions.

You should worry about other things many people aren't even aware of: bitness of local variables.
I fear the variable value is of type uint16_t as well.
Does it HVAE TO BE a 16bit variable?

Basically, ALL local variables SHOULD BE 32bit ones, preferrably unsinged.
ARM ALWAYS does its ALU operations in full 32bit. - there is ZERO gain in using narrower local variables (unlike Motorola 680x0). Quite to the contrary, using narrower local variables will force the compiler to put masking operations for nothing, because ARM can't do it differently.

You SHOULD change all the local variables to 32bit ones at all costs, unless you need the modulo effect (255+1 = 0, not 256) It WILL definitely make a clear difference.

The only exception is int16_t for multiplications. ARMv5E or later has DSP-like instructions such as smulxy, smlaxy that do signed 16bit multiply faster than the 32bit mul, mla instructions.

edited Feb 23 '22 at 06:29

answered Feb 23 '22 at 04:10

Jake 'Alquimista' LEE

6,197
2
17
25

When would unsigned really be of performance advantage vs. signed? As signed overflow is UB in C and C++, most operations could just use the underlying unsigned 32 bit hardware for signed operations. So int8_t, int16_t and int32_t are not so bad. Really bad are uint8_t and uint16_t (also used in the question). Would unsigned 32-bit multiply be faster than signed 32-bit multiply? – Sebastian Feb 23 '22 at 07:07
@Sebastian `SMULL` takes 1 more cycle than `UMULL` for example. (there might be more) The division routines require multiple extra instructions for the sign matching. (inclusive `xor`). And if I remember correctly, `asr` takes longer than `lsr` in some ALU instructions. Even if the cyles are the same, signed arithmetics could consume more power. – Jake 'Alquimista' LEE Feb 23 '22 at 07:38
@Sebastian Even on `aarch64`, typecasting `int32_t` to `int64_t` requires a sign extension instruction while the cpu has to do nothing for unsigned (it simply takes `x0` instead of `w0`) – Jake 'Alquimista' LEE Feb 23 '22 at 08:05
@Sebastian 16bit local variables are much worse than 8bit ones on ARMv5 or earlier. The compiler has to put `lsl r0, r0, #16; lsr r0, r0, #16` after pretty much every step while a `bic r0, r0, #0xff<<8` will do for 8bit ones. (`uxth` is one of the new instructions on ARMv6) – Jake 'Alquimista' LEE Feb 23 '22 at 08:22
Why can't the compiler just use wider variables, if they are local anyway? – Sebastian Feb 23 '22 at 08:49
@Sebastian Because the programmer explicitly ordered it not to by decalaring the varibles narrow. Programmers should be aware of the consequences of their actions. Unfortunately, all those fancy modern programming languages with garbage collection and without unsigned data type are dragging down the level of programmers. (Are you listening, Java?) – Jake 'Alquimista' LEE Feb 23 '22 at 09:22
I am all for programmers knowing the consequences of their actions and knowing a lot of the things going on behind the scenes from assembler to the workings of the processor and memory subsystem. Nevertheless C and C++ compilers are not forced to follow the explicit 'order' of the programmer. The *as-if rule* would allow code transformations that do not change the observable behaviour. Extending existing (e.g. in memory) signed narrower to wider types would need additional instructions / cycles, I can understand. But just using them as local variables should not need conversions, would it? – Sebastian Feb 23 '22 at 09:49
@Sebastian How is the compiler supposed to know that the programmer doesn't want the modulo effect? For people knowing what they are doing, such a nosy compiler would be a nightmare. – Jake 'Alquimista' LEE Feb 23 '22 at 09:58
As far as I understand the modulo effect is only guaranteed to work with unsigned integer; so for bit manipulation one should use unsigned integer, in nearly every other case signed integer. Multiplication/division is a special case, as arguments or the result, which are not fitting have to be converted (in comparison to addition/subtraction, where a reinterpretation usually is enough). – Sebastian Feb 23 '22 at 10:43
@Sebastian modulo effect also applies to signed values, guaranteedly. – Jake 'Alquimista' LEE Feb 23 '22 at 11:13

How many assembly instructions are in a bit shift operation?

2 Answers2