There is a well-known trick to perform division by an invariant integer actually without doing division at all, but instead doing multiplication. This has been discussed on Stack Overflow as well in Perform integer division using multiplication and in Why does GCC use multiplication by a strange number in implementing integer division?
However, I recently tested the following code both on AMD64 and on ARM (Raspberry Pi 3 model B):
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main(int argc, char **argv)
{
volatile uint64_t x = 123456789;
volatile uint64_t y = 0;
struct timeval tv1, tv2;
int i;
gettimeofday(&tv1, NULL);
for (i = 0; i < 1000*1000*1000; i++)
{
y = (x + 999) / 1000;
}
gettimeofday(&tv2, NULL);
printf("%g MPPS\n", 1e3 / ( tv2.tv_sec - tv1.tv_sec +
(tv2.tv_usec - tv1.tv_usec) / 1e6));
return 0;
}
The code is horribly slow on ARM architectures. In contrast, on AMD64 it's extremely fast. I noticed that on ARM, it calls __aeabi_uldivmod, whereas on AMD64 it actually does not divide at all but instead does the following:
.L2:
movq (%rsp), %rdx
addq $999, %rdx
shrq $3, %rdx
movq %rdx, %rax
mulq %rsi
shrq $4, %rdx
subl $1, %ecx
movq %rdx, 8(%rsp)
jne .L2
The question is, why? Is there some particular feature on the ARM architecture that makes this optimization infeasible? Or is it just due to the rareness of the ARM architecture that optimizations like this haven't been implemented?
Before people start suggestions in their comments, I'll say I tried both gcc and clang, and also tried the -O2 and -O3 optimization levels.
On my AMD64 laptop, it gives 1181.35 MPPS, whereas on the Raspberry Pi it gives 5.50628 MPPS. This is over 2 orders of magnitude difference!