1

There is a well-known trick to perform division by an invariant integer actually without doing division at all, but instead doing multiplication. This has been discussed on Stack Overflow as well in Perform integer division using multiplication and in Why does GCC use multiplication by a strange number in implementing integer division?

However, I recently tested the following code both on AMD64 and on ARM (Raspberry Pi 3 model B):

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int main(int argc, char **argv)
{
  volatile uint64_t x = 123456789;
  volatile uint64_t y = 0;
  struct timeval tv1, tv2;
  int i;
  gettimeofday(&tv1, NULL);
  for (i = 0; i < 1000*1000*1000; i++)
  {
    y = (x + 999) / 1000;
  }
  gettimeofday(&tv2, NULL);
  printf("%g MPPS\n", 1e3 / ( tv2.tv_sec - tv1.tv_sec +
                             (tv2.tv_usec - tv1.tv_usec) / 1e6));
  return 0;
}

The code is horribly slow on ARM architectures. In contrast, on AMD64 it's extremely fast. I noticed that on ARM, it calls __aeabi_uldivmod, whereas on AMD64 it actually does not divide at all but instead does the following:

.L2:
        movq    (%rsp), %rdx
        addq    $999, %rdx
        shrq    $3, %rdx
        movq    %rdx, %rax
        mulq    %rsi
        shrq    $4, %rdx
        subl    $1, %ecx
        movq    %rdx, 8(%rsp)
        jne     .L2

The question is, why? Is there some particular feature on the ARM architecture that makes this optimization infeasible? Or is it just due to the rareness of the ARM architecture that optimizations like this haven't been implemented?

Before people start suggestions in their comments, I'll say I tried both gcc and clang, and also tried the -O2 and -O3 optimization levels.

On my AMD64 laptop, it gives 1181.35 MPPS, whereas on the Raspberry Pi it gives 5.50628 MPPS. This is over 2 orders of magnitude difference!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
juhist
  • 4,210
  • 16
  • 33
  • what did you do to eliminate test overhead and check against alignment affects and other factors? – old_timer Dec 27 '17 at 15:49
  • @old_timer Nothing. Alignment shouldn't be an issue in this case. But I actually noticed that 32-bit integer division IS optimized on ARM, so perhaps it's 32 bits vs 64 bits issue... – juhist Dec 27 '17 at 15:50
  • _"is it just due to the rareness of the ARM architecture"_. Certainly not rareness, considering that pretty much every cell phone manufactured in the last 10-15 years is using some form of ARM CPU. – Michael Dec 27 '17 at 15:51
  • alignment makes a huge difference certainly on arm, but in general. your test measurement is flawed in general so you have to account for that. You need to have accurate data before you can start asking these kinds of questions, so far there is nothing here to indicate the multiplication or division is remotely related to your results. – old_timer Dec 27 '17 at 15:58
  • 1
    Not an expert on ARM, does it have 64x64->128 multiplication? – Margaret Bloom Dec 27 '17 at 15:59
  • alignment isnt as much of a factor in x86 due to the massive overhead. – old_timer Dec 27 '17 at 16:02
  • are you using aarch32 or aarch64? – old_timer Dec 27 '17 at 16:02
  • 1
    It's armv7l, i.e. aarch32. The lack of 64x64->128 multiplication could explain why it isn't optimized. – juhist Dec 27 '17 at 16:04

1 Answers1

3

gcc only uses a multiplicative inverse for division of register-width or narrower. You're testing x86-64 against ARM32, so uint64_t gives x86-64 a huge advantage in this case.

Extended-precision multiply would probably be worth it on 32-bit CPUs with high-throughput multiply, like modern x86, and maybe also your Cortex-A7 ARM if it has a multiplier that's much better pipelined than its divider.

It would take multiple mul instructions to get the high half of a 64b x 64b => 128b full multiply result using only 32x32 => 64b as a building block. (IIRC ARM32 has this.)

However, this is not what gcc or clang choose to do, at any optimization level.

If you want to handicap your x86 CPU, compile 32-bit code using -m32. x86 gcc -O3 -m32 will use __udivdi3. I wouldn't call this "fair", though, because a 64-bit CPU is much faster at 64-bit arithmetic, and a Cortex-A7 doesn't have a 64-bit mode available.

OTOH, a 32-bit-only x86 CPU couldn't be much faster than current x86 CPUs in 32-bit mode; the main cost of the extra transistors that go unused in 32-bit mode is die area and power, not so much top-end clock speed. Some low-power-budget CPUs (like ULV laptop chips) might sustain max turbo for longer if they were designed without support for long mode (x86-64), but that's pretty minor.

So it might be interesting to benchmark 32-bit x86 vs. 32-bit ARM, just to learn something about the microarchitectures maybe. But if you care about 64-bit integer performance, definitely compile for x86-64 not x86-32.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847