I am playing with code from https://github.com/torvalds/linux/blob/master/lib/math/div64.c using https://godbolt.org/, targeting ARM Cortex-M7 microcontroller.
clang generates much more code (208 lines of assembly), comparing to gcc (69 lines).
I am using -O3
and it should unroll/expand/etc, but... I am not sure if it is good in this case.
Don't know will it be helpful or not, but:
1) removing 1st or 2nd loop makes the result similar to produced by gcc;
2) clang 5.0.0 creates much smaller code.
Here is the code (below there is also the link to configured example):
unsigned int div64_32(unsigned long long * n, unsigned int base)
{
unsigned long long rem = *n;
unsigned long long b = base;
unsigned long long res, d = 1;
unsigned int high = rem >> 32;
/* Reduce the thing a bit first */
res = 0;
if (high >= base) {
high /= base;
res = (unsigned long long) high << 32;
rem -= (unsigned long long) (high*base) << 32;
}
while ((long long)b > 0 && b < rem) {
b = b+b;
d = d+d;
}
do {
if (rem >= b) {
rem -= b;
res += d;
}
b >>= 1;
d >>= 1;
} while (d);
*n = res;
return rem;
}
Options for both compilers:
ARM GCC 7.2.1 (none): -O3 -mcpu=cortex-m7 -mthumb -mfloat-abi=hard -mfpu=vfpv4
x86-x64 clang (trunk): -O3 -mcpu=cortex-m7 -mthumb -mfloat-abi=hard -mfpu=vfpv4 -target armv7m-none-eabi
And finally the link to example: link
Update
I made tests on NXP Kinetis KV58. Frequency 240 MHz, all code relocated to I-TCM.
Results, measured by oscilloscope:
run 1 run 2
iar: 27.2 ms 17.6 ms
clang: 29.6 ms 19.2 ms
gcc: 22.1 ms 14.0 ms
run-1 with n *= 123
(see code below)
run-2 with n *= 127
I didn't compile code for clang and gcc myself, just copy&paste it from website. I also made changes to GCC code, because compiler did not recognize instructions (also see comments to the question):
ldrd r0, [r0]
(line 9) => ldrd r0, r1, [r0]
)
strd r6, [ip]
(line 55) => strd r6, r7, [r12]
)
void test()
{
debug_pin(0, 0); // test pin 0: set LOW
for (int i = 0; i < 1000; i++);
debug_pin(0, 1); // test pin 0: set HIGH
unsigned long long n = 54321, base = 123;
for (int i = 0; i < 10000; i++)
{
base = div64_32_iar(&n, base);
n += 65432;
n *= 127; // 1st test was with 123
base += 123;
}
debug_pin(0, 0);
for (int i = 0; i < 1000; i++);
debug_pin(0, 1);
n = 54321, base = 123;
for (int i = 0; i < 10000; i++)
{
base = div64_32_clang(&n, base);
n += 65432;
n *= 127; // 1st test was with 123
base += 123;
}
debug_pin(0, 0);
for (int i = 0; i < 1000; i++);
debug_pin(0, 1);
n = 54321, base = 123;
for (int i = 0; i < 10000; i++)
{
base = div64_32_gcc(&n, base);
n += 65432;
n *= 127; // 1st test was with 123
base += 123;
}
debug_pin(0, 0);
}