unsigned int fun1 ( unsigned int a, unsigned int b )
{
return(a/b);
}
unsigned int fun2 ( unsigned int a )
{
return(a/2);
}
unsigned int fun3 ( unsigned int a )
{
return(a/3);
}
unsigned int fun10 ( unsigned int a )
{
return(a/10);
}
unsigned int fun13 ( void )
{
return(10/13);
}
and just try it.
00000000 <fun1>:
0: e92d4010 push {r4, lr}
4: ebfffffe bl 0 <__aeabi_uidiv>
8: e8bd4010 pop {r4, lr}
c: e12fff1e bx lr
00000010 <fun2>:
10: e1a000a0 lsr r0, r0, #1
14: e12fff1e bx lr
00000018 <fun3>:
18: e59f3008 ldr r3, [pc, #8] ; 28 <fun3+0x10>
1c: e0802093 umull r2, r0, r3, r0
20: e1a000a0 lsr r0, r0, #1
24: e12fff1e bx lr
28: aaaaaaab bge feaaaadc <fun13+0xfeaaaa9c>
0000002c <fun10>:
2c: e59f3008 ldr r3, [pc, #8] ; 3c <fun10+0x10>
30: e0802093 umull r2, r0, r3, r0
34: e1a001a0 lsr r0, r0, #3
38: e12fff1e bx lr
3c: cccccccd stclgt 12, cr12, [r12], {205} ; 0xcd
00000040 <fun13>:
40: e3a00000 mov r0, #0
44: e12fff1e bx lr
As one would expect, if the compiler couldn't deal with it compile-time then it calls the appropriate library function, which is the root of the performance issue. If you don't have a native divide instruction then you end up with many instructions executed, plus all of their fetches. 10 to 100 times slower sounds about right.
Interesting that they do use the 1/3 and 1/10th trick here, and if the result can be computed at compile time, then just return the fixed result.
Compiler authors can read the same Hackers Delight and Stack Overflow pages we can and know the same tricks and, if willing and interested, can implement those optimizations. Don't assume they always will; just because I have some version of some compiler that finds these doesn't mean all compilers can/will.
As far as whether you should let the compiler/toolchain do it or not for you: that's up to you; even if you have the divide instruction, if you target multiple platforms you may choose to shift right instead of divide by 2; you may choose to do other of these tricks. If you own the divide then you at least know what it is doing; if you give it over to the compiler then you have to regularly disassemble to understand what it is doing (if you care). If this is in a timing critical section then you may wish to do both, see what the compiler does, then steal that answer or create your own deterministic solution (leaving it up to the compiler is not necessarily deterministic and I think that is the point).
EDIT
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-objdump -D so.o
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 6.3.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I have a gcc 4.8.3 here that also produced those optimizations...as well as a 5.4.0, so they have been doing it for a while.
The arm UMULL
instruction is a 64 bit = 32 bit * 32 bit operation, so it can't overflow the multiply. Certainly for 1/3rd and 1/10th and not sure how large a value of N for 1/N you can go in 64 bits and have any 32 bit operand work. Performing a simple experiment shows that at least for these two cases all possible 32 bit patterns work that is for unsigned.
It appears to use the trick for signed as well:
int negfun ( int a )
{
return(a/3);
}
00000000 <negfun>:
0: e59f3008 ldr r3, [pc, #8] ; 10 <negfun+0x10>
4: e0c32390 smull r2, r3, r0, r3
8: e0430fc0 sub r0, r3, r0, asr #31
c: e12fff1e bx lr
10: 55555556 ldrbpl r5, [r5, #-1366] ; 0xfffffaaa