I am writing some assembly code for ARM8 (aarch64). I want to do a division and use the remainder obtained for further calculations. In x86 when I use 'div', and I know my remainder is kept in RDX. My question is - is there an equivalent to that in the aarch64 instruction set? I know 'udiv' and 'sdiv' do unsigned and signed divisions and gets me the quotient. Is there a single instruction which would give me remainder? (I want % modulo operator in c). I understand I can obtain it using algebra, just wanted to confirm I am not missing out on a simpler method.
2 Answers
Barring constant power-of-two divisors which can be optimised down to an and
, there is no instruction that will calculate the remainder of a division. You can, however do it pretty neatly in two:
// input: x0=dividend, x1=divisor
udiv x2, x0, x1
msub x3, x2, x1, x0
// result: x2=quotient, x3=remainder

- 20,095
- 3
- 40
- 77
-
This is wonderful! I used these 2 instructions to help in adding the digits of a given number for a programming assignment using a required loop. Thanks! – merlot Apr 06 '22 at 01:40
Calculating the remainder isn't a single instruction
The Clang C compiler produced the following code for a modulo calculation:
udiv x10, x0, x9
msub x10, x10, x9, x0
Good news, this isn't slow!
While the x86 does this in a single instruction, that doesn't make it faster.
On the Apple M-1, the above instruction pair is executed in about the same time as a single step. Possibly this is due to instruction macro-fusion that decodes multiple instructions into a single µ-op. It could also be due to parallelism in multiple execution units. Possibly, it is done in one EU where the remainder from the division calculation is cached and returned immediately.
Whatever the implementation, it seems to be about as fast as Intel's single instruction form.
Division-only
Timing:
$ time ./a.out 12345678901
Total: 301123495054
real 0m10.036s
user 0m9.668s
sys 0m0.031s
Instruction generated:
udiv x10, x0, x9
Remainder-only
Timing:
$ time ./a.out 12345678901
Total: 8612082846779832640
real 0m10.190s
user 0m9.768s
sys 0m0.070s
Instructions generated:
udiv x10, x0, x9
msub x10, x10, x9, x0
Division and remainder
Timing:
$ time ./a.out 12345678901
Total: 8612083123211969892
real 0m10.103s
user 0m9.752s
sys 0m0.019s
Instructions generated:
udiv x10, x0, x9
msub x11, x10, x9, x0
Benchmark code
The following C code can be run with either q = n / d
or r = n % d
commented out:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
unsigned long long n, d, q=1, r=1, total=0;
n = strtoull(argv[1], NULL, 10);
total = 0;
for (d=1 ; d<=n ; d++) {
q = n / d;
r = n % d;
total += q + r;
}
printf("Total: %llu", total);
return 0;
}

- 216,523
- 63
- 388
- 485
-
3You haven't proved it's a "single step" (fused into 1 wide instruction like mov/movk); I think you've proved that the throughput bottleneck of `udiv` can hide the extra work of an `msub` just fine, which is expected. I'd guess that a dependency chain through `%` would have higher latency than a dep chain through `/`, i.e. if the next iteration's inputs depended on a `%` this iteration, like `x %= d` if that can be arranged to not optimize away, e.g. by decrementing `d` instead of incrementing so it can't prove that it's a no-op. – Peter Cordes Apr 05 '21 at 23:04
-
@PeterCordes Given that multiple execution units are involved, the timings don't constitute proof. That said, it is likely given that 1) macro-fusion is a well known and proven technique, 2) essentially all reasonable hardware implementations of integer division will simultaneously produce a remainder, 3) it would be wasteful to not combine the steps, and 4) Daniel Lemire also showed that the M-1 fuses *mul* and *umulh* for the same reasons. – Raymond Hettinger Apr 05 '21 at 23:40
-
1Interesting; yes it's totally possible, and reasonably likely if mul fusion . The main reason *not* to do it would be to avoid having one "uop" producing 2 register outputs; something AArch64 generally avoids except for `ldp`. (Or instruction or whatever M-1 calls an internal pipeline slot). That's why AArch64 splits up `umull` into `umull` and `umulh`, or `vuzp` into `uzp1` / `uzp2`. But yes, high-performance implementations can fuse, if the decoders can find all the pairs they want to look for. – Peter Cordes Apr 06 '21 at 02:06
-
https://godbolt.org/z/oGh1jxxeP compiles (with clang and GCC) to a loop with `udiv`/`msub` *latency* as the bottleneck, vs. just `udiv` latency with `/=` instead of `%=`. That should be a better test for fusion; if it doesn't fuse the loop will actually be slower. – Peter Cordes Apr 06 '21 at 02:08
-
1Another test would be to try an unrelated msub, which we know can't fuse, to see if that's any slower. – BeeOnRope Apr 06 '21 at 05:03
-
4Per Dougall J, these are [not fused](https://dougallj.github.io/applecpu/firestorm.html): _Other tested patterns are not fused, including adrp + add, mov + movk, mul + umulh, and **udiv + msub.**_ (emphasis mine) – BeeOnRope Apr 11 '21 at 05:22