Obtaining remainder using single aarch64 instruction?

Question

I am writing some assembly code for ARM8 (aarch64). I want to do a division and use the remainder obtained for further calculations. In x86 when I use 'div', and I know my remainder is kept in RDX. My question is - is there an equivalent to that in the aarch64 instruction set? I know 'udiv' and 'sdiv' do unsigned and signed divisions and gets me the quotient. Is there a single instruction which would give me remainder? (I want % modulo operator in c). I understand I can obtain it using algebra, just wanted to confirm I am not missing out on a simpler method.

score 13 · Accepted Answer · answered Feb 11 '16 at 23:29

13

Barring constant power-of-two divisors which can be optimised down to an and, there is no instruction that will calculate the remainder of a division. You can, however do it pretty neatly in two:

// input: x0=dividend, x1=divisor
udiv x2, x0, x1
msub x3, x2, x1, x0
// result: x2=quotient, x3=remainder

answered Feb 11 '16 at 23:29

Notlikethat

20,095
3
40
77

This is wonderful! I used these 2 instructions to help in adding the digits of a given number for a programming assignment using a required loop. Thanks! – merlot Apr 06 '22 at 01:40

Raymond Hettinger · Answer 2 · 2021-04-06T09:05:28.500

6

Calculating the remainder isn't a single instruction

The Clang C compiler produced the following code for a modulo calculation:

udiv    x10, x0, x9
msub    x10, x10, x9, x0

Good news, this isn't slow!

While the x86 does this in a single instruction, that doesn't make it faster.

On the Apple M-1, the above instruction pair is executed in about the same time as a single step. Possibly this is due to instruction macro-fusion that decodes multiple instructions into a single µ-op. It could also be due to parallelism in multiple execution units. Possibly, it is done in one EU where the remainder from the division calculation is cached and returned immediately.

Whatever the implementation, it seems to be about as fast as Intel's single instruction form.

Division-only

Timing:

$ time ./a.out 12345678901
Total: 301123495054
real    0m10.036s
user    0m9.668s
sys 0m0.031s

Instruction generated:

udiv    x10, x0, x9

Remainder-only

Timing:

$ time ./a.out 12345678901
Total: 8612082846779832640
real    0m10.190s
user    0m9.768s
sys 0m0.070s

Instructions generated:

udiv    x10, x0, x9
msub    x10, x10, x9, x0

Division and remainder

Timing:

$ time ./a.out 12345678901
Total: 8612083123211969892
real    0m10.103s
user    0m9.752s
sys 0m0.019s

Instructions generated:

udiv    x10, x0, x9
msub    x11, x10, x9, x0

Benchmark code

The following C code can be run with either q = n / d or r = n % d commented out:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    unsigned long long n, d, q=1, r=1, total=0;

    n = strtoull(argv[1], NULL, 10);
    total = 0;
    for (d=1 ; d<=n ; d++) {
        q = n / d;
        r = n % d;
        total += q + r;
    }
    printf("Total: %llu", total);
    return 0;
}

edited Apr 06 '21 at 09:05

answered Apr 05 '21 at 22:36

Raymond Hettinger

216,523
63
388
485

3

You haven't proved it's a "single step" (fused into 1 wide instruction like mov/movk); I think you've proved that the throughput bottleneck of `udiv` can hide the extra work of an `msub` just fine, which is expected. I'd guess that a dependency chain through `%` would have higher latency than a dep chain through `/`, i.e. if the next iteration's inputs depended on a `%` this iteration, like `x %= d` if that can be arranged to not optimize away, e.g. by decrementing `d` instead of incrementing so it can't prove that it's a no-op. – Peter Cordes Apr 05 '21 at 23:04
@PeterCordes Given that multiple execution units are involved, the timings don't constitute proof. That said, it is likely given that 1) macro-fusion is a well known and proven technique, 2) essentially all reasonable hardware implementations of integer division will simultaneously produce a remainder, 3) it would be wasteful to not combine the steps, and 4) Daniel Lemire also showed that the M-1 fuses *mul* and *umulh* for the same reasons. – Raymond Hettinger Apr 05 '21 at 23:40
1

Interesting; yes it's totally possible, and reasonably likely if mul fusion . The main reason *not* to do it would be to avoid having one "uop" producing 2 register outputs; something AArch64 generally avoids except for `ldp`. (Or instruction or whatever M-1 calls an internal pipeline slot). That's why AArch64 splits up `umull` into `umull` and `umulh`, or `vuzp` into `uzp1` / `uzp2`. But yes, high-performance implementations can fuse, if the decoders can find all the pairs they want to look for. – Peter Cordes Apr 06 '21 at 02:06
https://godbolt.org/z/oGh1jxxeP compiles (with clang and GCC) to a loop with `udiv`/`msub` *latency* as the bottleneck, vs. just `udiv` latency with `/=` instead of `%=`. That should be a better test for fusion; if it doesn't fuse the loop will actually be slower. – Peter Cordes Apr 06 '21 at 02:08
1

Another test would be to try an unrelated msub, which we know can't fuse, to see if that's any slower. – BeeOnRope Apr 06 '21 at 05:03
4

Per Dougall J, these are [not fused](https://dougallj.github.io/applecpu/firestorm.html): _Other tested patterns are not fused, including adrp + add, mov + movk, mul + umulh, and **udiv + msub.**_ (emphasis mine) – BeeOnRope Apr 11 '21 at 05:22

Obtaining remainder using single aarch64 instruction?

2 Answers2

Calculating the remainder isn't a single instruction

Good news, this isn't slow!

Division-only

Remainder-only

Division and remainder

Benchmark code

Linked

Related