36

Is there a trick for creating a faster integer modulus than the standard % operator for particular bases?

For my program, I'd be looking for around 1000-4000 (e.g. n%2048). Is there a quicker way to perform n modulus 2048 than simply: n%2048?

AstroCB
  • 12,337
  • 20
  • 57
  • 73
Dan W
  • 3,520
  • 7
  • 42
  • 69
  • 22
    Nice choice of username. – Dan W Jun 14 '12 at 20:12
  • 2048 sounds suspiciously like you're going to use this in the context of RSA maybe? If so, you are better off looking into modular exponentiation directly. – emboss Jun 16 '12 at 13:17
  • 1
    Nope, it's for a quicker sine/cosine function (see [here](http://stackoverflow.com/questions/2088194/fast-sin-cos-using-a-pre-computed-translation-array)). The 2048 represents the table size. – Dan W Jun 16 '12 at 19:48

7 Answers7

52

If the denominator is known at compile time to be a power of 2, like your example of 2048, you could subtract 1 and do a bitwise-and.

That is:

n % m == n & (m - 1) 

...where m is a power of 2.

For example:

22 % 8 == 6

         Dec       Bin
       -----     -----
22 & (8 - 1) =   10110 
               & 00111 
               -------
           6 =   00110

Bear in mind that a good compiler will have its own optimizations for %, maybe even enough to be as fast as the above technique. Arithmetic operators tend to be pretty heavily optimized.

Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
  • 3
    Perhaps in C/C++, but in C#, I find a speed increase of at least 1.25x faster (and probably much higher since other stuff is going on as well - this is real world code). – Dan W Jun 14 '12 at 23:12
  • 1
    @DanW - Great to hear. I'm a C# man myself, so I was wondering how the performance would look in .NET. – Justin Morgan - On strike Jun 15 '12 at 00:20
  • 1
    @DanW: Keep in mind, this also works for *negative* values of *n*, whereas *%* does not. For negative *n* and positive *m*, *%* gives a negative remainder. – Mike Dunlavey Aug 17 '14 at 12:54
  • I just ran a quick test with the current C# compiler (in VS 2013), and I found a big difference in Debug mode, but no difference in Release mode. – Neil Nov 13 '14 at 06:56
  • 1
    Beware how the modulo (or divisor) is defined: C#/RuyJIT/VS2015 e.g. optimized my modulo and division to be 10 _times_ faster with modulo/divisor specified in the same method or as const vs. passed in as parameters, so pre-define them if possible. – Kristian Wedberg Mar 16 '16 at 01:07
  • 1
    And if you know in advance what `m` will exactly be, not just that it's a power of `2`, you can pre-subtract. To get a `mod` of `16`, use `n & 15`. – Aaron Franke Mar 21 '18 at 04:05
  • @AaronFranke - Good call. I had that in mind, but I should've been explicit. If you know `m` in advance, I'm guessing `&` would pretty much always be faster than `%`. But if you can't pre-subtract, you might find `%` is faster than `-` and `&` together. – Justin Morgan - On strike Mar 13 '19 at 16:00
16

For powers of two 2^n, all you have to do is zero out all bits except the last n bits.

For example (assuming 32 bit integers):

x%2 is equivalent to x & 0x00000001

x%4 is equivalent to x & 0x00000003

In general x % (2^n) is equal to x & (2^n-1). Written out in C, this would be x & ((1<<n)-1).

This is because 2^n gives you a 1 in the n+1th bit (from the right). So 2^n-1 will give you n ones on the right, and zeros on the left.

tskuzzy
  • 35,812
  • 14
  • 73
  • 140
3

Branchless non-power-of-two modulus is possible by precomputing magic constants at run-time, to implement division using a multiply-add-shift.

This is roughly 2x faster than the built-in modulo operator % on my Intel Core i5.

I'm surprised it's not more dramatic, as x86 CPU div instructions can have latencies as high as 80-90 cycles for 64-bit division on some CPUs, compared to mul at 3 cycles and bitwise ops at 1 cycle each.

Proof of concept and timings shown below. series_len refers to the number of modulus ops performed in series on a single var. That's to prevent the CPU from hiding latencies through parallelization.


#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

typedef int32_t s32;
typedef uint32_t u32;
typedef uint64_t u64;

#define NUM_NUMS 1024
#define NUM_RUNS 500
#define MAX_NUM UINT32_MAX
#define MAX_DEN 1024

struct fastdiv {
    u32 mul;
    u32 add;
    s32 shift;
    u32 _odiv;  /* save original divisor for modulo calc */
};

static u32 num[NUM_NUMS];
static u32 den[NUM_NUMS];
static struct fastdiv fd[NUM_NUMS];

/* hash of results to prevent gcc from optimizing out our ops */
static u32 cookie = 0;

/* required for magic constant generation */
u32 ulog2(u32 v) {
    u32 r, shift;
    r =     (v > 0xFFFF) << 4; v >>= r;
    shift = (v > 0xFF  ) << 3; v >>= shift; r |= shift;
    shift = (v > 0xF   ) << 2; v >>= shift; r |= shift;
    shift = (v > 0x3   ) << 1; v >>= shift; r |= shift;
                                            r |= (v >> 1);
    return r;
}

/* generate constants for implementing a division with multiply-add-shift */
void fastdiv_make(struct fastdiv *d, u32 divisor) {
    u32 l, r, e;
    u64 m;

    d->_odiv = divisor;
    l = ulog2(divisor);
    if (divisor & (divisor - 1)) {
        m = 1ULL << (l + 32);
        d->mul = (u32)(m / divisor);
        r = (u32)m - d->mul * divisor;
        e = divisor - r;
        if (e < (1UL << l)) {
            ++d->mul;
            d->add = 0;
        } else {
            d->add = d->mul;
        }
        d->shift = l;
    } else {
        if (divisor == 1) {
            d->mul = 0xffffffff;
            d->add = 0xffffffff;
            d->shift = 0;
        } else {
            d->mul = 0x80000000;
            d->add = 0;
            d->shift = l-1;
        }
    }
}

/* 0: use function that checks for a power-of-2 modulus (speedup for POTs)
 * 1: use inline macro */
#define FASTMOD_BRANCHLESS 0

#define fastdiv(v,d) ((u32)(((u64)(v)*(d)->mul + (d)->add) >> 32) >> (d)->shift)
#define _fastmod(v,d) ((v) - fastdiv((v),(d)) * (d)->_odiv)

#if FASTMOD_BRANCHLESS
#define fastmod(v,d) _fastmod((v),(d))
#else
u32 fastmod(u32 v, struct fastdiv *d) {
    if (d->mul == 0x80000000) {
        return (v & ((1 << d->shift) - 1));
    }
    return _fastmod(v,d);
}
#endif

u32 random32(u32 upper_bound) {
    return arc4random_uniform(upper_bound);
}

u32 random32_range(u32 lower_bound, u32 upper_bound) {
    return random32(upper_bound - lower_bound) + lower_bound;
}

void fill_arrays() {
    int i;
    for (i = 0; i < NUM_NUMS; ++i) {
        num[i] = random32_range(MAX_DEN, MAX_NUM);
        den[i] = random32_range(1, MAX_DEN);
        fastdiv_make(&fd[i], den[i]);
    }
}

void fill_arrays_pot() {
    u32 log_bound, rand_log;
    int i;

    log_bound = ulog2(MAX_DEN);
    for (i = 0; i < NUM_NUMS; ++i) {
        num[i] = random32_range(MAX_DEN, MAX_NUM);
        rand_log = random32(log_bound) + 1;
        den[i] = 1 << rand_log;
        fastdiv_make(&fd[i], den[i]);
    }
}

u64 clock_ns() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec*1000000000 + tv.tv_usec*1000;
}

void use_value(u32 v) {
    cookie += v;
}

int main(int argc, char **arg) {
    u64 builtin_npot_ns;
    u64 builtin_pot_ns;
    u64 branching_npot_ns;
    u64 branching_pot_ns;
    u64 branchless_npot_ns;
    u64 branchless_pot_ns;
    u64 t0, t1;
    u32 v;
    int s, r, i, j;
    int series_len;

    builtin_npot_ns = builtin_pot_ns = 0;
    branching_npot_ns = branching_pot_ns = 0;
    branchless_npot_ns = branchless_pot_ns = 0;

    for (s = 5; s >= 0; --s) {
        series_len = 1 << s;
        for (r = 0; r < NUM_RUNS; ++r) {
            /* built-in NPOT */
            fill_arrays();
            t0 = clock_ns();
            for (i = 0; i < NUM_NUMS; ++i) {
                v = num[i];
                for (j = 0; j < series_len; ++j) {
                    v /= den[i];
                }
                use_value(v);
            }
            t1 = clock_ns();
            builtin_npot_ns += (t1 - t0) / NUM_NUMS;

            /* built-in POT */
            fill_arrays_pot();
            t0 = clock_ns();
            for (i = 0; i < NUM_NUMS; ++i) {
                v = num[i];
                for (j = 0; j < series_len; ++j) {
                    v /= den[i];
                }
                use_value(v);
            }
            t1 = clock_ns();
            builtin_pot_ns += (t1 - t0) / NUM_NUMS;

            /* branching NPOT */
            fill_arrays();
            t0 = clock_ns();
            for (i = 0; i < NUM_NUMS; ++i) {
                v = num[i];
                for (j = 0; j < series_len; ++j) {
                    v = fastmod(v, fd+i);
                }
                use_value(v);
            }
            t1 = clock_ns();
            branching_npot_ns += (t1 - t0) / NUM_NUMS;

            /* branching POT */
            fill_arrays_pot();
            t0 = clock_ns();
            for (i = 0; i < NUM_NUMS; ++i) {
                v = num[i];
                for (j = 0; j < series_len; ++j) {
                    v = fastmod(v, fd+i);
                }
                use_value(v);
            }
            t1 = clock_ns();
            branching_pot_ns += (t1 - t0) / NUM_NUMS;

            /* branchless NPOT */
            fill_arrays();
            t0 = clock_ns();
            for (i = 0; i < NUM_NUMS; ++i) {
                v = num[i];
                for (j = 0; j < series_len; ++j) {
                    v = _fastmod(v, fd+i);
                }
                use_value(v);
            }
            t1 = clock_ns();
            branchless_npot_ns += (t1 - t0) / NUM_NUMS;

            /* branchless POT */
            fill_arrays_pot();
            t0 = clock_ns();
            for (i = 0; i < NUM_NUMS; ++i) {
                v = num[i];
                for (j = 0; j < series_len; ++j) {
                    v = _fastmod(v, fd+i);
                }
                use_value(v);
            }
            t1 = clock_ns();
            branchless_pot_ns += (t1 - t0) / NUM_NUMS;
        }

        builtin_npot_ns /= NUM_RUNS;
        builtin_pot_ns /= NUM_RUNS;
        branching_npot_ns /= NUM_RUNS;
        branching_pot_ns /= NUM_RUNS;
        branchless_npot_ns /= NUM_RUNS;
        branchless_pot_ns /= NUM_RUNS;

        printf("series_len = %d\n", series_len);
        printf("----------------------------\n");
        printf("builtin_npot_ns    : %llu ns\n", builtin_npot_ns);
        printf("builtin_pot_ns     : %llu ns\n", builtin_pot_ns);
        printf("branching_npot_ns  : %llu ns\n", branching_npot_ns);
        printf("branching_pot_ns   : %llu ns\n", branching_pot_ns);
        printf("branchless_npot_ns : %llu ns\n", branchless_npot_ns);
        printf("branchless_pot_ns  : %llu ns\n\n", branchless_pot_ns);
    }
    printf("cookie=%u\n", cookie);
}

Results

Intel Core i5 (MacBookAir7,2), macOS 10.11.6, clang 8.0.0

series_len = 32
----------------------------
builtin_npot_ns    : 218 ns
builtin_pot_ns     : 225 ns
branching_npot_ns  : 115 ns
branching_pot_ns   : 42 ns
branchless_npot_ns : 110 ns
branchless_pot_ns  : 110 ns

series_len = 16
----------------------------
builtin_npot_ns    : 87 ns
builtin_pot_ns     : 89 ns
branching_npot_ns  : 47 ns
branching_pot_ns   : 19 ns
branchless_npot_ns : 45 ns
branchless_pot_ns  : 45 ns

series_len = 8
----------------------------
builtin_npot_ns    : 32 ns
builtin_pot_ns     : 34 ns
branching_npot_ns  : 18 ns
branching_pot_ns   : 10 ns
branchless_npot_ns : 17 ns
branchless_pot_ns  : 17 ns

series_len = 4
----------------------------
builtin_npot_ns    : 15 ns
builtin_pot_ns     : 16 ns
branching_npot_ns  : 8 ns
branching_pot_ns   : 3 ns
branchless_npot_ns : 7 ns
branchless_pot_ns  : 7 ns

series_len = 2
----------------------------
builtin_npot_ns    : 8 ns
builtin_pot_ns     : 7 ns
branching_npot_ns  : 4 ns
branching_pot_ns   : 2 ns
branchless_npot_ns : 2 ns
branchless_pot_ns  : 2 ns
2

You could zero out the high order bits i.e.

x = 11 = 1011
x % 4 = 3 = 0011

so for x % 4 you could just take the last 2 bits - I'm not sure what would happen if negative numbers were used though

2

Here's a few techniques that replicate the modulus operation.

Of those benchmarked, this was the fastest (modified to fit your 2048 scenario). As long as your "max" isn't millions and in the 1000-4000 range you mentioned, it may work faster for you too:

int threshold = 2048; //the number to mod by
int max = 1000; //the number on the left. Ex: 1000 % 2048
int total = 0;
int y = 0;
for (int x = 0; x < max; x++)
{
    if (y > (threshold - 1))
    {
        y = 0;
        total += x;
    }
    y += 1;
}
return total;

Give it a go. It performed faster on the author's machine at various settings, so should perform admirably well for you too.

Ice-Blaze
  • 897
  • 8
  • 18
0

If you are dividing by literals that are powers of two, then the answer is probably No: Any decent compiler will automatically turn such expressions into a variation of an AND operation, which is pretty close to optimal.

  • 1
    In C#, I do get a speed increase, see above. – Dan W Jun 14 '12 at 23:14
  • Out of curiosity, what is the data type and are you measuring a stand-alone release build? – 500 - Internal Server Error Jun 14 '12 at 23:17
  • Integer. Release build without debugging. It was used to further increase the speed of calculating a quick sine function from Charles' answer [here](http://stackoverflow.com/questions/2088194/fast-sin-cos-using-a-pre-computed-translation-array). See the `index %= TABLE_SIZE;` bit. – Dan W Jun 14 '12 at 23:25
-1

The fastest way to multiply/divide unsigned integers numbers is by bit shifting them left or right. Shift operations match directly to CPU commands. For example, 3 << 2 =6, while 4>>1 = 2.

You can use the same trick to calculate the module: Shift an integer far enough to the left so that only the remainder bits are left, then shift it back right so you can check the remainder value.

On the other hand, integer modulo also exists as a CPU command. If the integer modulo operator maps to this command in optimized builds, you will not see any improvement by using the bit shift trick.

The following code caclulates 7%4 by shifting far enough that only the 2 last bits are left (since 4=2^2). This means that we need to shift 30 bits:

uint i=7;
var modulo=((i<<30)>>30);

The result is 3

EDIT:

I just read all the solutions proposing simply erasing the higher order bits. It has the same effect, but a lot simpler and direct.

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • It's _not at all_ true that CPU `div`/`mod` instructions are as fast as bitwise instructions. They are much, much slower. Around [80-90 cycles](https://gmplib.org/~tege/x86-timing.pdf) for a 64-bit `div` compared to 3 cycles for `mul` and 1 cycle for `and`. –  Dec 07 '17 at 02:56
  • @jrodatus and this is related to this answer how? Never mind that talking about cycles without mentioning the CPU is meaningless. It used to be that mul took 90 or more instructions. And pointing to someone's homework on old CPUs doesn't prove anything. The behavior of each CPU instruction is documented, you don't need to measure them indirectly. PS: and then there are SIMD commands, AVX and SSE4 that make *flotating* point as fast or faster if you have enough data – Panagiotis Kanavos Dec 07 '17 at 09:56
  • @jrodatus you may want to read the other questions and comments as well - this is a C# question, not an assembly question. The generated IL and assembly can differe from one processor to the next and one CPU to the next. As Kristian Wedberg mentioned in a comment, you can get 10x better performance with mod now. If the compiler detects that the divisor is a power of 2 it may well use one of the many bitwise techniques here – Panagiotis Kanavos Dec 07 '17 at 10:03
  • 1 (Relevance). You incorrectly stated _"If the integer modulo operator maps to this [**integer modulo** CPU] command in optimized builds, you will not see any improvement by using the bit shift trick."_ Contrary to that statement, bit shifting _is_ faster than div/mod CPU commands on modern CPUs. Note the benchmark code and timing results in my answer. (cont'd) –  Dec 07 '17 at 11:51
  • 2 (C vs. C#). The question title says **C** / C#. C maps to assembly, so CPU instruction timings **are** relevant (as you yourself stated). (cont'd) –  Dec 07 '17 at 11:51
  • 3 ('homework'). That is a report, updated 2017, by the developers of the [GNU Multiple Precision Arithmetic Library](https://gmplib.org/), distributed with OS's and used in serious cryptographic, scientific and industrial applications. That is not "someone's homework on old CPUs." **AMD Zen** (covered in the report) was launched in [February 2017](https://en.wikipedia.org/wiki/Zen_(microarchitecture)). –  Dec 07 '17 at 11:52
  • 4. You're right / good point that I should've mentioned CPU builds when referencing timing results (duh). Updated my answer. I'm not trying to be argumentative, just wanted to point out that instruction timings DO make a difference in modulo performance, contrary to what you wrote (saying "you will not see any improvement"). Again, see my benchmark code and performance results (Intel Core i5). –  Dec 07 '17 at 11:59
  • @jrodatus Posting an unrelated comment 5 years later? OK. You win – Panagiotis Kanavos Dec 07 '17 at 11:59
  • @Kanavos, I just realized why you might've found equal performance for both methods. When you measured your timings, if you didn't use the result value, some compilers will optimize out the operation entirely. Hashing the results and printing the hash generally fixes this, and should show the difference on most [modern] CPUs –  Dec 07 '17 at 13:13