Optimising the 64-bit modulus operation, for a repeatedly used runtime modulo

Question

I have billions of unsigned 64-bit numbers (x) for which I need to find x % m. m is the same for all, but is not a compile-time constant. m is greater than one.

Both x and m are equally likely to hold any 64-bit value.

(If m were a compile-time constant, I could just trust the compiler to do its best.)

As m is the same for all of these operations, and I have storage available, is there an optimisation which will let me calculate all the results for x % m faster than if m were different for each x?

(I'm inspired by the division-by-a-constant optimisation which transforms division into multiplication, shifts and addition. Illustrative code in C, Rust or pseudo-code would be great.)

I have looked at the document, but their optimisation of calculating modulus without calculating the quotient only applies to divisors of 2^n +/- 1.

@AlbertoSinigaglia Yes, but `m` is constant for billions of values of `x`. — fadedbee, Dec 16 '22 at 12:15
@chux-ReinstateMonica `m` is a uint64_t which is greater than one, as zero would be undefined behaviour and `x % 1` is trivially zero. — fadedbee, Dec 16 '22 at 12:20
Did you measure that the modulo operation is the cause of the (suspected) performace problem? How long does it currently take? — Bodo, Dec 16 '22 at 12:22
fadedbee, "is there an optimization" --> when `m` is a power-of-2 (easy to test), then only an `and` is needed. — chux - Reinstate Monica, Dec 16 '22 at 12:23
@Bodo, No but modulus, and load and store are the only operations inside a tight loop. I will test. — fadedbee, Dec 16 '22 at 12:26
@chux-ReinstateMonica There are only 64 powers of two (in `uint64_t`) so it is unlikely that `m` will ever be a power of two. — fadedbee, Dec 16 '22 at 12:27
The distribution of `m` was not specified. In fact the overall context was not either. Providing the larger picture could be useful to improve performance in ways other than this narrow approach. Good luck. — chux - Reinstate Monica, Dec 16 '22 at 12:30
There is actually a lot that you can do, for any `m`, similar to what happens at compile time but you'd do it at runtime. BTW, checking just in case, how would you use this? Or are you interested "in general"? There are various cases with special solutions, such as using Montgomery modular multiplication (if applicable) or using multiplication by modular multiplicative inverse to check for divisibility. — harold, Dec 16 '22 at 12:51
@fadedbee Barrett reduction is quite generally applicable. However, it mostly helps in the case that `x` is significantly larger than `m`. If `x` is usually close to `m`, most of the Barrett reduction does nothing, and it's all down to the conditional subtraction. — harold, Dec 16 '22 at 13:02
Optimization problems should usually state the hardware being used, and other relevant environmental features, and the context of the problem. (E.g., if these `x` are coming as the result multiplications modulo `m`, then [Montgomery multiplication](https://en.wikipedia.org/wiki/Montgomery_modular_multiplication) may be relevant. If they are coming via additions, then simple compare-and-subtract may be useful.) Update the problem with additional information. — Eric Postpischil, Dec 16 '22 at 13:33
The statement “Both x and m are equally likely to hold any 64-bit value” is dubious. The purposes for which humans perform modulo arithmetic are not evenly distributed over all 64-bit moduli, nor do the processes they use generate uniformly distributed subject values. Zero is a 64-bit value. Is it equally likely to be a modulus? What about one? — Eric Postpischil, Dec 16 '22 at 13:35
@fadedbee "is there an optimisation which will let me calculate all the results for x % m faster" --> is the goal to be faster on average ( I think this is what you want), or in worse case or what? — chux - Reinstate Monica, Dec 16 '22 at 17:46
[libdivide](https://github.com/ridiculousfish/libdivide) was written exactly for this purpose in C and C++ — phuclv, Dec 17 '22 at 12:57
Does this answer your question? [Repeated integer division by a runtime constant value](https://stackoverflow.com/questions/45353629/repeated-integer-division-by-a-runtime-constant-value) — phuclv, Jan 12 '23 at 13:09

fadedbee · Answer 1 · 2022-12-17T10:22:44.193

Below is an approach that I will test, as the comments on the question have made it plain that the distribution of x and m are important.

What I hadn't realised is that it is 50% likely that m > x.

Untested pseudo-code:

fn mod(x: u64, m: u64) -> u64 {
    if m > x {
        x
    } else if 2*m > x {
        x - m
    } else {
        x % m
    }
}

This may well be what the compiler produces. I had not tested.

Update: I've tested this optimisation on my i5 2500K, processing 1 GiB of data.

The result is a stable 67% speed increase:

opt: 713.113926ms
plain: 1.195687298s

The benchmarking code is:

use std::time::Instant;
use rand::Rng;

const ARR_LEN: usize = 128 * 1024;
const ROUNDS: usize = 1024;

fn plain_mod_test(x: &[u64; ARR_LEN], m: u64, result: &mut [u64; ARR_LEN]) {
    for i in 0..ARR_LEN {
        result[i] = x[i] % m;
    }
}

fn opt_mod_test(x: &[u64; ARR_LEN], m: u64, result: &mut [u64; ARR_LEN]) {
    for i in 0..ARR_LEN {
        result[i] = if m > x[i] {
            x[i]
        } else if m > x[i] / 2 {
            x[i] - m
        } else {
            x[i] % m
        }
    }
}

fn main() {
    // 1 MiB of pseudo-random values x
    let mut rng = rand::thread_rng();
    let mut x = [0u64; ARR_LEN];
    for i in 0..ARR_LEN {
        x[i] = rng.gen();
    }

    // 1 KiB of pseudo-random modulii m
    let mut m = [0u64; ROUNDS];
    for r in 0..ROUNDS {
        m[r] = rng.gen(); // there's only a 1 in 2^64 chance that 0 will be generated
    }

    // 1 GiB of output each, use Vec to avoid stack overflow
    let mut plain_results = vec![[0u64; ARR_LEN]; ROUNDS];
    let mut opt_results = vec![[0u64; ARR_LEN]; ROUNDS];

    // These loops modulus 1GB of data each.
    let start_opt = Instant::now();
    for r in 0..ROUNDS {
        opt_mod_test(&x, m[r], &mut opt_results[r]);
    }
    let stop_opt = Instant::now();
    println!("opt: {:?}", stop_opt - start_opt);

    let start_plain = Instant::now();
    for r in 0..ROUNDS {
        plain_mod_test(&x, m[r], &mut plain_results[r]);
    }
    let stop_plain = Instant::now();
    println!("plain: {:?}", stop_plain - start_plain);


    // Stop the results from being optimised away, by using them.
    let mut plain_sum = 0;
    let mut opt_sum = 0;
    for r in 0..ROUNDS {
        for i in 0..ARR_LEN {
            plain_sum += plain_results[r][i];
            opt_sum += opt_results[r][i];
        }
    }

    println!("opt_sum: {:?}", opt_sum);
    println!("plain_sum: {:?}", plain_sum);

}

`2*m` risks overflow. Maybe `m > x/2` or something like that. — chux - Reinstate Monica, Dec 16 '22 at 13:21
It highly depends on the processor architecture what effect this implementation will have. You would have to compare the clock cycles needed for the instructions involved. The compiler might generate a substraction and a conditional jump for the first case, a bit shift, a subtraction, a conditional jump and another subtraction for the second case and a division for the third case. The time needed to check for one case is added to the time needed for the following cases. Alternative implementation `uint64_t mod(uint64_t x,uint64_t m){if(m>x)return x;else{x-=m;if(m>x)return x;else return x%m;}}` — Bodo, Dec 16 '22 at 14:19

Optimising the 64-bit modulus operation, for a repeatedly used runtime modulo

1 Answers1