Most optimized way to calculate modulus in C

Question

I have minimize cost of calculating modulus in C. say I have a number x and n is the number which will divide x

when n == 65536 (which happens to be 2^16):

mod = x % n (11 assembly instructions as produced by GCC) or
mod = x & 0xffff which is equal to mod = x & 65535 (4 assembly instructions)

so, GCC doesn't optimize it to this extent.

In my case n is not x^(int) but is largest prime less than 2^16 which is 65521

as I showed for n == 2^16, bit-wise operations can optimize the computation. What bit-wise operations can I preform when n == 65521 to calculate modulus.

I can't imagine that any C compiler implements integer `%` as anything other than an `IDIV` (see Krystian's answer), a single instruction, apart from setup. I have little doubt that there is no algorithm that can obtain the result faster on the same CPU. — Carl Smotricz, Apr 18 '10 at 09:22
Did you turn -O2 on? Did you add `const` keyword to `n` declaration? — P Shved, Apr 18 '10 at 09:30
@Carl, Just because it is a single instruction doesn't mean that it will be the fastest you can get. Case in point, back in 2000 it was possible to calculate sqrt by doing several operations in plain c code (no assembly required) and beat the processor in both speed and accuracy (I think newer processors use the same trick internally now). A single instruction won't necessarily take a single cycle to compute! — Grant Peters, Apr 18 '10 at 10:48
@Pavel: I had a very similar question a while back, why something wasn't being optimised. In my case `n` needed to be const in GCC 3.something, but not 4.3. — Steve Jessop, Apr 18 '10 at 11:07
@Carl: It's actually common for compilers to optimize division or remainder operations by a compile-time constant. Typically a remainder operation is turned into a sequence involving a couple of multiplications, some shifts, and possibly an addition or two. Google for 'Division by Invariant Integers using Multiplication' by Granlund and Montgomery for some of the original work in this area. — Mark Dickinson, Apr 18 '10 at 11:10
Also note that a modulo operation with a compile time constant as modulus is easier to implement if the arguments are unsigned. Doing this with signed arguments requires some extra instructions. — Accipitridae, Apr 18 '10 at 11:37
How did you run gcc? When I wrote a function to return the modulus of a number by 65536 gcc came up with a single instruction: `movzwl %di, %eax` which I thought was pretty neat. — CB Bailey, Apr 18 '10 at 12:10
@Mark: You're right, of course. I failed to read from the description that the divisor would be known ahead of time, i.e. constant. — Carl Smotricz, Apr 19 '10 at 06:45
Actual compiler output (optimized and un-optimized): https://godbolt.org/g/yjvoru. It costs a lot of extra instructions to do signed `x % 65536` efficiently with C semantics, because it's not the same thing as `x & 65535`. Still, it's not 11 instructions. It's 6 in optimized code, and if your bad counting method is adding 3 instructions on top of `and r32, 0xffff` or `movzx eax, di`, that only gets you up to "9" for optimized signed `%` so you're doing something weird. (Assuming you were targeting x86). Maybe you used a compiler even older than what godbolt has (gcc 4.4.7 is the oldest) — Peter Cordes, Sep 19 '17 at 11:51

score 27 · Answer 1 · edited Oct 13 '20 at 07:48

First, make sure you're looking at optimized code before drawing conclusion about what GCC is producing (and make sure this particular expression really needs to be optimized). Finally - don't count instructions to draw your conclusions; it may be that an 11 instruction sequence might be expected to perform better than a shorter sequence that includes a div instruction.

Also, you can't conclude that because x mod 65536 can be calculated with a simple bit mask that any mod operation can be implemented that way. Consider how easy dividing by 10 in decimal is as opposed to dividing by an arbitrary number.

With all that out of the way, you may be able to use some of the 'magic number' techniques from Henry Warren's Hacker's Delight book:

There was an added chapter on the website that contained "two methods of computing the remainder of division without computing the quotient!", which you may find of some use. The 1st technique applies only to a limited set of divisors, so it won't work for your particular instance. I haven't actually read the online chapter, so I don't know exactly how applicable the other technique might be for you.

+1. Hacker's Delight and the attendant website are excellent resources. Note that it's sometimes possible to do better than the compiler if you have some extra information that the compiler lacks: for example, if you know (from context or from algorithm analysis) an upper bound on the possible size of the dividend. — Mark Dickinson, Apr 18 '10 at 11:19
That last chapter is pretty much limited to values expressed by (2^k +/- 1). So it only applies to a very narrow subset of problems. Optimized, but definitely not general. So, I can't see many uses in the broad application of what we define as reality. — ingyhere, Feb 19 '14 at 17:22

caf · Answer 2 · 2010-04-18T12:35:46.837

12

x mod 65536 is only equivalent to x & 0xffff if x is unsigned - for signed x, it gives the wrong result for negative numbers. For unsigned x, gcc does indeed optimise x % 65536 to a bitwise and with 65535 (even on -O0, in my tests).

Because 65521 is not a power of 2, x mod 65521 can't be calculated so simply. gcc 4.3.2 on -O3 calculates it using x - (x / 65521) * 65521; the integer division by a constant is done using integer multiplication by a related constant.

edited Apr 18 '10 at 12:35

answered Apr 18 '10 at 09:33

caf

233,326
40
323
462

2

-1. Using the processor's integer divide function is probably not optimal. On my machine (Intel Core 2 Duo, running in 64-bit mode), a simple C test program with gcc 4.4 and -O3 turns x % 65521 into two multiplications, two shifts and two subtractions. The way to find out for sure would be to do some timings, of course. :) – Mark Dickinson Apr 18 '10 at 11:32
1

Actual gcc output: https://godbolt.org/g/yjvoru. Note that `gcc -O0` doesn't optimize `int n=65536; return x % n;`, because `-O0` makes code that still does what the C source says after you modify variables while single-stepping with a debugger, or even jump to a different line. i.e. it spills/reloads everything between C statements, and makes no assumptions except within a statement. I only point this out because the OP said they had a variable `n` which happened to be 65536. (In which case gcc -O0 just uses `idiv`.) Anyway, obviously counting instructions != performance analysis! – Peter Cordes Sep 19 '17 at 12:07

Accipitridae · Answer 3 · 2010-04-18T12:35:33.403

rIf you don't have to fully reduce your integers modulo 65521, then you can use the fact that 65521 is close to 2**16. I.e. if x is an unsigned int you want to reduce then you can do the following:

unsigned int low = x &0xffff;
unsigned int hi = (x >> 16);
x = low + 15 * hi;

This uses that 2**16 % 65521 == 15. Note that this is not a full reduction. I.e. starting with a 32-bit input, you only are guaranteed that the result is at most 20 bits and that it is of course congruent to the input modulo 65521.

This trick can be used in applications where there are many operations that have to be reduced modulo the same constant, and where intermediary results do not have to be the smallest element in its residue class.

E.g. one application is the implementation of Adler-32, which uses the modulus 65521. This hash function does a lot of operations modulo 65521. To implement it efficiently one would only do modular reductions after a carefully computed number of additions. A reduction shown as above is enough and only the computation of the hash will need a full modulo operation.

score 3 · Answer 4 · answered Apr 18 '10 at 09:21

3

The bitwise operation only works well if the divisor is of the form 2^n. In the general case, there is no such bit-wise operation.

answered Apr 18 '10 at 09:21

Danvil

22,240
19
65
88

score 1 · Answer 5 · answered Aug 30 '14 at 14:14

If the constant with which you want to take the modulo is known at compile time and you have a decent compiler (e.g. gcc), tis usually best to let the compiler work its magic. Just declare the modulo const.

If you don't know the constant at compile time, but you are going to take - say - a billion modulos with the same number, then use this http://libdivide.com/

gabi tomuta · Answer 6 · 2017-09-21T11:08:11.327

1

As an approach when we deal with powers of 2, can be considered this one (mostly C flavored):

.
.

#define THE_DIVISOR    0x8U;  /* The modulo value (POWER OF 2). */
.
.
uint8 CheckIfModulo(const sint32 TheDividend)
{
    uint8 RetVal = 1; /* TheDividend is not modulus THE_DIVISOR. */

    if (0 == (TheDividend & (THE_DIVISOR - 1)))
    {
        /* code if modulo is satisfied */
        RetVal = 0; /* TheDividend IS modulus THE_DIVISOR. */
    }
    else
    {
        /* code if modulo is NOT satisfied */
    }
    return RetVal;
}

edited Sep 21 '17 at 11:08

answered Sep 19 '17 at 10:10

gabi tomuta

11
3

Is this supposed to be useful for cases where `TheDivisor` is known by the programmer to be a power of 2, but the compiler *doesn't* know that? If the divisor is a compile-time constant, this has no advantage. – Peter Cordes Sep 19 '17 at 11:59
Why do you treat `0 % 8` as false but `8 % 8` as true? 0 is a multiple of 8. There's nothing about this method that makes it only work dividend>divisor. You used signed integers, but this does work with negative dividends. (e.g. `-16` is `0xfffffff0` in 32-bit 2's complement representation, so checking the low 4 bits still works to check if it's a multiple of 16.) – Peter Cordes Sep 19 '17 at 12:03
Peter, thank you for your observations. I come from drivers development, hence I did not considered negative numbers. Still your observations about the 0 is totally applying here and I corrected my example. – gabi tomuta Sep 21 '17 at 11:08
If your numbers are known to be non-negative, an unsigned type is often a good idea. Unless the compiler can prove that negative numbers are impossible, it has to generate correct code, which is often slower for things like `x / 4` with signed integer (because of C signed division / modulo semantics). https://godbolt.org/g/crTsWi. Assuming performance matters when writing drivers, use the right types for the job so the compiler can generate efficient code. – Peter Cordes Sep 21 '17 at 12:53

score 1 · Answer 7 · answered Feb 26 '18 at 13:35

1

If x is an increasing index, and the increment i is known to be less than n (e.g. when iterating over a circular array of length n), avoid the modulus completely. A loop going

x += i; if (x >= n) x -= n;

is way faster than

x = (x + i) % n;

which you unfortunately find in many text books...

If you really need an expression (e.g. because you are using it in a for statement), you can use the ugly but efficient

x = x + (x+i < n ? i : i-n)

answered Feb 26 '18 at 13:35

David

166
6

1

This is true when `n` isn't a compile-time constant power of 2. If `n` is a power of 2, it is faster to unconditionally `and` with `n-1` to do an unsigned modulo. (Compilers will do that for you if `n` is a compile-time constant and the types are unsigned.) – Peter Cordes Feb 26 '18 at 16:58

score 0 · Answer 8 · answered Apr 18 '10 at 09:16

0

idiv — Integer Division

The idiv instruction divides the contents of the 64 bit integer EDX:EAX (constructed by viewing EDX as the most significant four bytes and EAX as the least significant four bytes) by the specified operand value. The quotient result of the division is stored into EAX, while the remainder is placed in EDX.

source: http://www.cs.virginia.edu/~evans/cs216/guides/x86.html

answered Apr 18 '10 at 09:16

Kris

1,388
6
12

5

`idiv` or a similar opcode (depending on the CPU) might be the best option if the denominator is a variable whose value isn't known until runtime, but if the denominator is a known constant there are optimizations that can be performed that may run faster than `idiv`. However, today's compilers are aware of those optimizations (as well as how to use the `div` opcode to get a remainder), so there's generally nothing special that needs to be done in a C program to take advantage of them. – Michael Burr Apr 18 '10 at 10:14

Most optimized way to calculate modulus in C

8 Answers8

Linked