Accelerating performance of loop performing unsigned long long modulo operation

Question

I need to perform a many operations of finding remainders of the division unsigned long long number by the 16-bit modulus:

unsigned long long largeNumber;
long residues[100];
unsigned long modules[100];
intiModules(modules); //set different 16-bit values

for(int i = 0; i < 100; i++){
     residues[i] = largeNumber % modules[i];
}

How I can accelerate this loop?

The iteration count is not large (32-128), but this loop performed very often so its speed is critical.

I don't think you can do much here. Maybe hand coding it in assembler might help a bit. But anyway 100 is not really "many". — Jabberwocky, Feb 27 '14 at 09:50
One option is to use pthreads to execute many modulus operations in parallel. — Jay K, Feb 27 '14 at 09:51
If your modules values range is continuous then you can have just one variable to store it and then decrement that variable in the loop. For eg, if your values are in the range (high,low), then `for(i=low, { i<=high,i++); residue[i-low]=largeNumber%i; }` — brokenfoot, Feb 27 '14 at 09:52
What is the return value of `intiModules`? If it's 16 bit then storing it to short **may** be faster since the longer the division, the longer the latency (according to Agner Fog's table). Also, use SIMD and multiprocessing will enhance the performance — phuclv, Feb 27 '14 at 09:58
Does `unsigned long modules[100] = intiModules();` really works? — Lee Duhem, Feb 27 '14 at 10:12
@user2810512 This is confusing. Anyway, you could write something like `unsigned long modules[100]; initModules(modules);` for the same purpose. — Lee Duhem, Feb 27 '14 at 10:37

score 2 · Answer 1 · edited May 23 '17 at 12:02

If speed is critical, according to this answer about branch prediction and this one, loop unrolling may be of help, avoiding the test induced by the for instruction, reducing the number of tests and improving "branch prediction".

The gain (or none, some compilers do that optimization for you) varies based on architecture / compiler.

On my machine, changing the loop while preserving the number of operations from

for(int i = 0; i < 500000000; i++){
    residues[i % 100] = largeNumber % modules[i % 100];
}

to

for(int i = 0; i < 500000000; i+=5){
    residues[(i+0) % 100] = largeNumber % modules[(i+0) % 100];
    residues[(i+1) % 100] = largeNumber % modules[(i+1) % 100];
    residues[(i+2) % 100] = largeNumber % modules[(i+2) % 100];
    residues[(i+3) % 100] = largeNumber % modules[(i+3) % 100];
    residues[(i+4) % 100] = largeNumber % modules[(i+4) % 100];
}

with gcc -O2 the gain is ~15%. (500000000 instead of 100 to observe a more significant time difference)

I doubt 'i < x' is the real bottleneck; Also I doubt that e.g. IA can execute multiple DIV operations in parallel -- which is the issue I tried to address. But overall, this is an excellent technique to try, as it's just so plain simple. — Aki Suihkonen, Feb 27 '14 at 11:28

score 1 · Accepted Answer · answered Feb 27 '14 at 10:13

Division by a constant (and there are only 65536 of them) can be performed by multiplication of the reciprocal followed/preceded by some fine-tuning. Since this method is accurate for a limited range, one can use some techniques to reduce the 64-bit operand to a much smaller value (which is still congruent to the original value):

// pseudo code -- not c
a = 0x1234567890abcdefULL;
a = 0x1234 << 48 + 0x5678 << 32 + 0x90ab << 16 + 0xcdef;

a % N === ((0x1234 * (2^48 % N) +     // === means 'is congruent'
           (0x5678 * (2^32 % N)) +    // ^ means exponentation
           (0x90ab * (2^16 % N)) + 
           (0xcdef * 1)) % N;

The intermediate value can be calculated with (small) multiplications only and the final remainder (%N) has potential to be calculated with reciprocal multiplication.

Accelerating performance of loop performing unsigned long long modulo operation

2 Answers2