Efficient mod 3 in x86 assembly

Question

The DIV instruction is expensive on modern processors. Is there a faster way to reduce a 64-bit integer mod 3 in x86 assembly?

If you're going to use this only to test whether the result is zero or not, there is a much simpler way to test for divisibility than actually computing the remainder. — harold, May 27 '17 at 11:02
Strongly related: [How does the GCC implementation of module (%) work, and why does it not use the div instruction?](https://stackoverflow.com/questions/4361979/how-does-the-gcc-implementation-of-module-work-and-why-does-it-not-use-the) — Cody Gray - on strike, May 27 '17 at 12:09
For the divisibility check mentioned by @harold, see [this answer](https://stackoverflow.com/a/27847160/780717) — njuffa, May 27 '17 at 13:29

njuffa · Accepted Answer · 2017-05-27T14:11:36.160

9

There are algorithms for that, based on performing division by multiplication with the reciprocal of the divisor. There are various papers on this, the one most commonly cited is:

Torbjörn Granlund and Peter L. Montgomery. "Division by invariant integers using multiplication." ACM SIGPLAN Notices. Vol. 29, No. 6, August 1994, pp. 61-72 (online)

Your C/C++ compiler very likely already uses a variant of this algorithm when optimizations are turned on. For example, my Intel compiler, version 13, turns this:

#include <stdint.h>
uint64_t mod3 (uint64_t a)
{
    return a % 3;
}

into this (line-end annotations mine):

mod3    PROC
; parameter 1: rcx
        mov       r8, 0aaaaaaaaaaaaaaabH      ;; (scaled) reciprocal of 3
        mov       rax, rcx
        mul       r8                          ;; multiply with reciprocal
        shr       rdx, 1                      ;; quotient
        lea       r9, QWORD PTR [rdx+rdx*2]   ;; back multiply with 3
        neg       r9
        add       rcx, r9                     ;; subtract from dividend 
        mov       rax, rcx                    ;; remainder
        ret
mod3    ENDP

edited May 27 '17 at 14:11

answered May 27 '17 at 02:55

njuffa

23,970
4
78
130

3

There's also: Niels Möller and Torbjörn Granlund. ["Improved division by invariant integers."](https://gmplib.org/~tege/division-paper.pdf), which is better suited to modern processors. But probably not as efficient for this special case. – Brett Hale May 27 '17 at 03:15
Indeed, all x86-64 compilers that I've seen apply this algorithm, generating almost identical code. The only difference is that Intel tends to do `neg`+`add`, instead of `sub`. I've never figured out a good reason for that; their documentation doesn't indicate that `add` is any faster than `sub`. So either this is just a quirk of the compiler's internal implementation, or they have some interesting inside information on their own processors that says different. – Cody Gray - on strike May 27 '17 at 12:12
Impressive that's actually faster. – Joshua May 27 '17 at 14:40
1

@Joshua 64bit division is pretty slow. Even on something new like Kaby calculating 0/3 with a 64bit div takes over 30 cycles and that's the *best* case, it goes up to nearly 50 cycles for higher dividends (or nearly 100 when considering all possible dividends/divisors). – harold May 27 '17 at 16:04

Efficient mod 3 in x86 assembly

1 Answers1