Micro-optimizing C code for ARM

Question

Apparently it's true that on ARM cpus, division is 10-100x slower than bit shifts. On this site it is stated that this can be solved in a number of ways. One of them being look-up tables for small problems, which is fine and standard. But listed was also replacing division with multiplication by a fixed-point reciprocal followed by a bit shift (so that x/3 becomes (x*6) << 1 etc) Another was replacing (x % y) > z with x > (z * y).

I'm far from an expert, but this sounds really odd to me. I mean, if you're using a modern compiler, wouldn't this be exactly the kind of thing that is optimized for you?

Optimization is implementation and addressing (32bit/64bit) specific. uProcessor, OS, compiler, etc determine how or even if something is optimized. Is there a specific problem you are trying to address? — ryyker, Feb 28 '17 at 19:58
That technique only works for numbers in a given range, or for specific divisors, but the compiler does not know what the input will be. — Weather Vane, Feb 28 '17 at 19:59
@WeatherVane Aah, are you saying that because x*6 might be larger than what the data type allows for, the compiler can't guarantee that x*6 >> 1 = x/3? — Benjamin Lindqvist, Feb 28 '17 at 20:03
No, I am saying the compiler does not know you will be dividing by 3 at run time. Your example efficiency is for a divisor which is known at compile time: a constant. But you can't expect the compiler to generate code which will detect various divisors at run time, to optimise the algorithm. — Weather Vane, Feb 28 '17 at 20:08
But if you don't know the denominator at compile time, you're in the same position as the compiler, i.e. this technique is unavailable to the developer as well, right? But it seems to me that even if the denominator is constant, the compiler wouldn't be able to optimize anyway because the multiplication might cause integer overflow... — Benjamin Lindqvist, Feb 28 '17 at 20:12
How about having a look at the code a modern compiler like gcc generates with and without optimisations instead of asking a too broad question? IOW: What have you found out by your own research? Did you do any? Why not? — too honest for this site, Feb 28 '17 at 20:19
Lol, the question was answered perfectly succinctly and objectively. Looking at gcc code would reveal that the optimization technique was NOT utilized, but it would not reveal WHY. A question can be valid despite not amounting to "I can't find the bug in my code, please help me" — Benjamin Lindqvist, Feb 28 '17 at 20:24
Of course the compiler did not report its decision making process. Did you expect it to say "I considered 1000 other algorithms, each listed with the reasons why I decided not to use. I know another 2000 algorithms (listed) which I did not think worth considering"? — Weather Vane, Feb 28 '17 at 20:35
@WeatherVane Sorry, I can't make out who your last comment is addressed to. — Benjamin Lindqvist, Feb 28 '17 at 20:42
You mentioned that the compiler does not tell you WHY is does something. I would have directed it to Olaf if meant for him, knowing that the OP gets flagged anyway. — Weather Vane, Feb 28 '17 at 20:49
Ok, I was confused because it just makes absolutely zero sense if addressed to me. Just none. — Benjamin Lindqvist, Feb 28 '17 at 20:55
You said the compiler "would not reveal WHY" it generated the code it did, and I attempted to show how that expectation is absurd. — Weather Vane, Feb 28 '17 at 20:56

score 3 · Accepted Answer · edited Mar 01 '17 at 09:28

unsigned int fun1 ( unsigned int a, unsigned int b )
{
    return(a/b);
}
unsigned int fun2 ( unsigned int a )
{
    return(a/2);
}
unsigned int fun3 ( unsigned int a )
{
    return(a/3);
}
unsigned int fun10 ( unsigned int a )
{
    return(a/10);
}
unsigned int fun13 ( void )
{
    return(10/13);
}

and just try it.

00000000 <fun1>:
   0:   e92d4010    push    {r4, lr}
   4:   ebfffffe    bl  0 <__aeabi_uidiv>
   8:   e8bd4010    pop {r4, lr}
   c:   e12fff1e    bx  lr

00000010 <fun2>:
  10:   e1a000a0    lsr r0, r0, #1
  14:   e12fff1e    bx  lr

00000018 <fun3>:
  18:   e59f3008    ldr r3, [pc, #8]    ; 28 <fun3+0x10>
  1c:   e0802093    umull   r2, r0, r3, r0
  20:   e1a000a0    lsr r0, r0, #1
  24:   e12fff1e    bx  lr
  28:   aaaaaaab    bge feaaaadc <fun13+0xfeaaaa9c>

0000002c <fun10>:
  2c:   e59f3008    ldr r3, [pc, #8]    ; 3c <fun10+0x10>
  30:   e0802093    umull   r2, r0, r3, r0
  34:   e1a001a0    lsr r0, r0, #3
  38:   e12fff1e    bx  lr
  3c:   cccccccd    stclgt  12, cr12, [r12], {205}  ; 0xcd

00000040 <fun13>:
  40:   e3a00000    mov r0, #0
  44:   e12fff1e    bx  lr

As one would expect, if the compiler couldn't deal with it compile-time then it calls the appropriate library function, which is the root of the performance issue. If you don't have a native divide instruction then you end up with many instructions executed, plus all of their fetches. 10 to 100 times slower sounds about right.

Interesting that they do use the 1/3 and 1/10th trick here, and if the result can be computed at compile time, then just return the fixed result.

Compiler authors can read the same Hackers Delight and Stack Overflow pages we can and know the same tricks and, if willing and interested, can implement those optimizations. Don't assume they always will; just because I have some version of some compiler that finds these doesn't mean all compilers can/will.

As far as whether you should let the compiler/toolchain do it or not for you: that's up to you; even if you have the divide instruction, if you target multiple platforms you may choose to shift right instead of divide by 2; you may choose to do other of these tricks. If you own the divide then you at least know what it is doing; if you give it over to the compiler then you have to regularly disassemble to understand what it is doing (if you care). If this is in a timing critical section then you may wish to do both, see what the compiler does, then steal that answer or create your own deterministic solution (leaving it up to the compiler is not necessarily deterministic and I think that is the point).

EDIT

arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-objdump -D so.o

arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 6.3.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I have a gcc 4.8.3 here that also produced those optimizations...as well as a 5.4.0, so they have been doing it for a while.

The arm UMULL instruction is a 64 bit = 32 bit * 32 bit operation, so it can't overflow the multiply. Certainly for 1/3rd and 1/10th and not sure how large a value of N for 1/N you can go in 64 bits and have any 32 bit operand work. Performing a simple experiment shows that at least for these two cases all possible 32 bit patterns work that is for unsigned.

It appears to use the trick for signed as well:

int negfun ( int a )
{
    return(a/3);
}

00000000 <negfun>:
   0:   e59f3008    ldr r3, [pc, #8]    ; 10 <negfun+0x10>
   4:   e0c32390    smull   r2, r3, r0, r3
   8:   e0430fc0    sub r0, r3, r0, asr #31
   c:   e12fff1e    bx  lr
  10:   55555556    ldrbpl  r5, [r5, #-1366]    ; 0xfffffaaa

Note ARM does have some cores with divides, and there are many other processors that dont have divide so this really isnt/shouldnt be an ARM question. — old_timer, Feb 28 '17 at 21:00
Wow, that's really interesting! Now I'm a bit ashamed I didn't do the same due diligence. Do you mind sharing compiler version and flags? — Benjamin Lindqvist, Feb 28 '17 at 21:10
http://homepage.cs.uiowa.edu/~jones/bcd/divide.html I think you need more bits for the 1/10 and 1/3 thing to work. But it is pretty easy to do an experiment and see... — old_timer, Feb 28 '17 at 21:22
Good point, so I tested all 2^32 inputs for the /3 and /10 cases and I got no failures — harold, Feb 28 '17 at 21:28
I obviously trust it, but I don't understand it :D Why doesn't this overflow if you use huge inputs? — Benjamin Lindqvist, Feb 28 '17 at 21:30
Yeah, okay I see now that is a 64 bit result and it is taking the bits from 33 up so yes that should work, and ran an experiment anyway to confirm all 32 bit patterns work for divide by 3 at least. — old_timer, Feb 28 '17 at 21:32
umulll is a 64 bit = 32 bit * 32 bit multiply, there is no overflow on the multiply. then they take the upper 32 bits of the result and shift that down by one to compensate for the decimal point location. — old_timer, Feb 28 '17 at 21:33
if your instruction set has a 32bit = 32 bit * 32 bit then you cannot do this 1/n trick for all possible inputs, some will work but many wont. because ARM has this 64 = 32 * 32 it works quite nicely, enough bits of precision. Now for the 1/N thing will it work for all 32 bit values of N ? — old_timer, Feb 28 '17 at 21:39
Interesting question, can we always find a multiplying factor that is smaller than 2^16 for any (32 bit) N? I'm going to think about if this can be answered analytically. — Benjamin Lindqvist, Feb 28 '17 at 22:09

score 1 · Answer 2 · edited May 23 '17 at 10:29

Divide by constant is often optimized by compilers to a multiply and shift sequence even on processors with a divide instruction. In some cases the sequence is a bit longer, but still only uses one multiply. Link to prior thread about this.

Why does GCC use multiplication by a strange number in implementing integer division?

Divide by variable on a processor without a divide is usually handled by an optimized function, based on some variation of the methods mentioned in this wiki article:

http://en.wikipedia.org/wiki/Division_algorithm#Fast_division_methods

Using a 32 bit by 32 bit divide as an example, there may be 3 main paths used. For divisor < 256, the divide by constant method can be used (256 entry table). For expected quotients < 256, an unfolded subtract and shift sequence may be used. The main path does a table lookup to get an initial approximation, then a sequence that includes 4 multiplies, some adds, subtracts, and shifts to quadruple the number of correct bits from the table value in the estimated quotient such that estimated quotient = actual quotient or actual quotient - 1. Then the product of estimated quotient * divisor is subtracted from dividend, and if remainder >= divisor, quotient is incremented and divisor subtracted from dividend. For a 64 bit by 64 bit divide, the main sequence would involve 6 multiplies, ... to produce the estimated quotient.

Micro-optimizing C code for ARM

2 Answers2