Fastest way to calculate a 128-bit integer modulo a 64-bit integer

Question

I have a 128-bit unsigned integer A and a 64-bit unsigned integer B. What's the fastest way to calculate A % B - that is the (64-bit) remainder from dividing A by B?

I'm looking to do this in either C or assembly language, but I need to target the 32-bit x86 platform. This unfortunately means that I cannot take advantage of compiler support for 128-bit integers, nor of the x64 architecture's ability to perform the required operation in a single instruction.

Edit:

Thank you for the answers so far. However, it appears to me that the suggested algorithms would be quite slow - wouldn't the fastest way to perform a 128-bit by 64-bit division be to leverage the processor's native support for 64-bit by 32-bit division? Does anyone know if there is a way to perform the larger division in terms of a few smaller divisions?

Re: How often does B change?

Primarily I'm interested in a general solution - what calculation would you perform if A and B are likely to be different every time?

However, a second possible situation is that B does not vary as often as A - there may be as many as 200 As to divide by each B. How would your answer differ in this case?

How fast must be function? How many 128 by 64 modulo operations per second do you expect? — GJ., Apr 10 '10 at 18:09
The Russian Peasant algorithm is simple but it uses loops and don't take advantage of the divide instruction in x86. You can use the algorithm [here](https://stackoverflow.com/questions/4771823/64-32-bit-division-on-a-processor-with-32-16-bit-division), it's about 64/32 bit division by 32/16 bit divide instruction but you can double it to 128/64 bit by 64/32 bit — phuclv, Jan 29 '14 at 02:57
Should answers want to test their code [this wiki answer](http://stackoverflow.com/a/39257118/2410359) is available. — chux - Reinstate Monica, Sep 22 '16 at 18:47

score 34 · Accepted Answer · edited Sep 10 '21 at 18:14

34

You can use the division version of Russian Peasant Multiplication.

To find the remainder, execute (in pseudo-code):

X = B;

while (X <= A/2)
{
    X <<= 1;
}

while (A >= B)
{
    if (A >= X)
        A -= X;
    X >>= 1;
}

The modulus is left in A.

You'll need to implement the shifts, comparisons and subtractions to operate on values made up of a pair of 64 bit numbers, but that's fairly trivial (likely you should implement the left-shift-by-1 as X + X).

This will loop at most 255 times (with a 128 bit A). Of course you need to do a pre-check for a zero divisor.

edited Sep 10 '21 at 18:14

Jérôme Richard

41,678
6
29
59

answered Apr 02 '10 at 12:15

caf

233,326
40
323
462

6

Code has bug. Interesting that it was not reported in _6_ years. Try `A=2, B=1` goes to infinite loop. `0x8711dd11 mod 0x4388ee88` fails (result s/b 1, not 0x21c47745) as well as others. Suggest `while (X < A/2)` --> `while (X <= A/2)` to repair. Your pseudo code as tested `unsigned cafMod(unsigned A, unsigned B) { assert(B); unsigned X = B; while (X < A / 2) { X <<= 1; } while (A >= B) { if (A >= X) A -= X; X >>= 1; } return A; } ` – chux - Reinstate Monica Aug 31 '16 at 18:49
2

@chux: You're absolutely right, fixed. It probably wasn't reported earlier because it only happens when A = 2ⁿ B or A = 2ⁿ B + 1. Thanks! – caf Sep 01 '16 at 01:13
1

Yup, in x86 asm implementing `x<<=1` as `add lo,lo`/`adc mid,mid`/... is more efficient than `shl lo`/`rcl mid,1`/... But in C the compiler should do that for you. Of course in x86 asm, you should actually use `bsr` (bit-scan) or `lzcnt` (leading-zero count) to find the position of the highest set bit, then use `shld hi, mid2, cl` / ... / `shl low, cl` to do all the shifting in one step instead of looping for that first `while (x <= A/2)` loop. In 32-bit mode, using SSE2 for XMM SIMD shifts with 64-bit elements is tempting, especially to reduce branching for leading-zero counts >= 32 – Peter Cordes Jul 17 '19 at 14:52

score 13 · Answer 2 · edited Jun 03 '21 at 20:34

13

If your B is small enough for the uint64_t + operation to not wrap:

Given A = AH*2^64 + AL:

A % B == (((AH % B) * (2^64 % B)) + (AL % B)) % B
      == (((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B

If your compiler supports 64-bit integers, then this is probably the easiest way to go. MSVC's implementation of a 64-bit modulo on 32-bit x86 is some hairy loop filled assembly (VC\crt\src\intel\llrem.asm for the brave), so I'd personally go with that.

edited Jun 03 '21 at 20:34

Peter Cordes

328,167
45
605
847

answered Apr 05 '10 at 03:31

MSN

53,214
7
75
105

No, as Paul sed the target is 32-bit x86 platform. Intel CPUs under IA32 doesn't support 64 bit division or 128 bit multiplication this only possible in 64 bit CPU mode. In that case the method described by caf is much faster! – GJ. Apr 05 '10 at 08:02
2

@GJ, if the compiler supports 64-bit integers, it will be easier to just use the mod operation for 64-bit integers. caf's method is the one used by MSVC anyway for 32-bit x86, based on my cursory evaluation of the assembly. It also includes an optimization for dividends below 2^32. So you could either code it yourself or just use the existing compiler support. – MSN Apr 05 '10 at 16:21
@MNS, jep you are right it will be easier, but demand is speed! Optimisation for dividends below 2^32 isn't useful if you are using random (full spectre) UInt64 because the ratio between 2^32 and 2^64 numbers is very, very small. – GJ. Apr 05 '10 at 17:03
@GJ, Yes, it's 1/2^32. If the optimal way to divide with a 64-bit dividend requires a bunch of branching anyways (which it does), adding an extra branch for < 2^32 is not going to impact performance. – MSN Apr 05 '10 at 18:56
1

This is also nice because you get a nice perf boost for free if/when you move to x86_64. – Billy ONeal Apr 06 '10 at 06:15
5

I'm not sure I understand how this works. B is 64-bit, so (AH % B) and ((2^64 - B) % B)) will both be 64-bit. Won't multiplying these together give us a 128-bit number, thus leaving us still needing to perform a 128-bit by 64-bit modulo? – user200783 Apr 10 '10 at 12:54
2

Thanks for the idea to look at how compilers implement 64-bit by 64-bit modulo on x86. From what I can tell, neither GCC (the function __udivmoddi4 in libgcc2.c) nor MSVC (see ullrem.asm for the unsigned version) use caf's "Russian Peasant" method. Instead, they both seem to use a variation on algorithm Q in the link provided by Dale Hagglund (with n=2, b=32) - approximating the 64-bit by 64-bit division using a 64-bit by 32-bit division, then performing a slight adjustment to correct the result if necessary. – user200783 Apr 10 '10 at 14:02
1

This would be a highly recursive algorithm. – drawnonward Apr 11 '10 at 00:15
2

Problem with this approach: The `*` multiplication needs a 128-bit result making the last step `some_128_bit_positive_value % some_128_bit_positive_value` and we are back where we started. Try 0x8000_0000_0000_0000_0000_0000_0000_0000 mod 0xFFFF_FFFF_FFFF_FFFE. I'd say the answer should be 2, but your algorithm gives 0, (Assuming the product of your multiplication is modulo 64-bit). This code does work for "128-bit integer modulo a 32-bit integer". Perhaps my testing is wrong, but I'd like to know the result of your testing. – chux - Reinstate Monica Aug 31 '16 at 20:01
Meant to say "making the last step some_128_bit_positive_value % some_64_bit_positive_value" – chux - Reinstate Monica Aug 31 '16 at 20:09
2

@chux: I agree the answer should be `2` for `0x80000000000000000000000000000000 % 0xFFFFFFFFFFFFFFFE`. I tested it in [`calc`, the cmdline arbitrary-precision calculator](http://www.isthe.com/chongo/tech/comp/calc/). I confirmed that truncating to 64 bits (with a bitwise AND with (2^64-1)) breaks the formula, so it does essentially leave you at square 1. `(((AH % B) * ((2^64 - B) % B))&(2^64-1) + (AL % B))&(2^64-1) % B == 0` but `(((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B == 2`. I used `AH=A>>64` and `AL=0`. – Peter Cordes Sep 01 '16 at 09:01

score 13 · Answer 3 · answered Apr 06 '10 at 06:49

13

Perhaps you're looking for a finished program, but the basic algorithms for multi-precision arithmetic can be found in Knuth's Art of Computer Programming, Volume 2. You can find the division algorithm described online here. The algorithms deal with arbitrary multi-precision arithmetic, and so are more general than you need, but you should be able to simplify them for 128 bit arithmetic done on 64- or 32-bit digits. Be prepared for a reasonable amount of work (a) understanding the algorithm, and (b) converting it to C or assembler.

You might also want to check out Hacker's Delight, which is full of very clever assembler and other low-level hackery, including some multi-precision arithmetic.

answered Apr 06 '10 at 06:49

Dale Hagglund

16,074
4
30
37

1

Thanks, I think I understand how the algorithms described at sputsoft.com apply to this situation. AFAICT, Algorithm G shows how to perform an mb-bit by nb-bit division as a series of m-n+1 (n+1)b-bit by nb-bit divisions, where b is the number of bits per digit. Algorithm Q then shows how to perform each of these (n+1)b-bit by nb-bit divisions as a single 2b-bit by b-bit division. Given that the largest dividend we can handle is 64-bit, we need to set b=32. The algorithms thus break down our 128-bit by 64-bit division (m=4, n=2) into 3 64-bit by 32-bit divisions. Does this sound accurate? – user200783 Apr 10 '10 at 12:40
I can tell you've already put more detailed thought into the algorithms than I did when I posted my reply, so I can't say for sure whether your final count of division operations is right. However, I do think you've got the basic idea of how to proceed. – Dale Hagglund Apr 10 '10 at 14:36
Another thought: you might want to consider 16-bit digits if you're writing in C and hence don't have direct access to 32b x 32b -> 64b multiply instructions, or don't want to embed your 32-bit digits into a 64-bit integer and use the compiler's own builtin 64-bit arithmetic. I can't think of a strong reason to avoid the latter, but you might want to check out the generated assembly code for it, if you're really, really, really concerned about speed. – Dale Hagglund Apr 10 '10 at 14:42
That sputsoft link seems to be invalid now. Not sure why—the site is still there. [This page](http://kanooth.com/blog/tag/numbers-project+basic-theory) seems to be connected, in that the [`kanooth-numbers`](https://github.com/janmarthedal/kanooth-numbers) library was once called `sputsoftnumbers`. – Craig McQueen Sep 04 '13 at 04:50
1

The sputsoft page is now located here: https://janmr.com/blog/2009/08/implementing-multiple-precision-arithmetic-part-2/ – Cheng Sun Jul 30 '17 at 20:53

GJ. · Answer 4 · 2010-04-06T22:37:56.687

This is almost untested partly speed modificated Mod128by64 'Russian peasant' algorithm function. Unfortunately I'm a Delphi user so this function works under Delphi. :) But the assembler is almost the same so...

function Mod128by64(Dividend: PUInt128; Divisor: PUInt64): UInt64;
//In : eax = @Dividend
//   : edx = @Divisor
//Out: eax:edx as Remainder
asm
//Registers inside rutine
//Divisor = edx:ebp
//Dividend = bh:ebx:edx //We need 64 bits + 1 bit in bh
//Result = esi:edi
//ecx = Loop counter and Dividend index
  push    ebx                     //Store registers to stack
  push    esi
  push    edi
  push    ebp
  mov     ebp, [edx]              //Divisor = edx:ebp
  mov     edx, [edx + 4]
  mov     ecx, ebp                //Div by 0 test
  or      ecx, edx                
  jz      @DivByZero
  xor     edi, edi                //Clear result
  xor     esi, esi
//Start of 64 bit division Loop
  mov     ecx, 15                 //Load byte loop shift counter and Dividend index
@SkipShift8Bits:                  //Small Dividend numbers shift optimisation
  cmp     [eax + ecx], ch         //Zero test
  jnz     @EndSkipShiftDividend
  loop    @SkipShift8Bits         //Skip 8 bit loop
@EndSkipShiftDividend:
  test    edx, $FF000000          //Huge Divisor Numbers Shift Optimisation
  jz      @Shift8Bits             //This Divisor is > $00FFFFFF:FFFFFFFF
  mov     ecx, 8                  //Load byte shift counter
  mov     esi, [eax + 12]         //Do fast 56 bit (7 bytes) shift...
  shr     esi, cl                 //esi = $00XXXXXX
  mov     edi, [eax + 9]          //Load for one byte right shifted 32 bit value
@Shift8Bits:
  mov     bl, [eax + ecx]         //Load 8 bits of Dividend
//Here we can unrole partial loop 8 bit division to increase execution speed...
  mov     ch, 8                   //Set partial byte counter value
@Do65BitsShift:
  shl     bl, 1                   //Shift dividend left for one bit
  rcl     edi, 1
  rcl     esi, 1
  setc    bh                      //Save 65th bit
  sub     edi, ebp                //Compare dividend and  divisor
  sbb     esi, edx                //Subtract the divisor
  sbb     bh, 0                   //Use 65th bit in bh
  jnc     @NoCarryAtCmp           //Test...
  add     edi, ebp                //Return privius dividend state
  adc     esi, edx
@NoCarryAtCmp:
  dec     ch                      //Decrement counter
  jnz     @Do65BitsShift
//End of 8 bit (byte) partial division loop
  dec     cl                      //Decrement byte loop shift counter
  jns     @Shift8Bits             //Last jump at cl = 0!!!
//End of 64 bit division loop
  mov     eax, edi                //Load result to eax:edx
  mov     edx, esi
@RestoreRegisters:
  pop     ebp                     //Restore Registers
  pop     edi
  pop     esi
  pop     ebx
  ret
@DivByZero:
  xor     eax, eax                //Here you can raise Div by 0 exception, now function only return 0.
  xor     edx, edx
  jmp     @RestoreRegisters
end;

At least one more speed optimisation is possible! After 'Huge Divisor Numbers Shift Optimisation' we can test divisors high bit, if it is 0 we do not need to use extra bh register as 65th bit to store in it. So unrolled part of loop can look like:

  shl     bl,1                    //Shift dividend left for one bit
  rcl     edi,1
  rcl     esi,1
  sub     edi, ebp                //Compare dividend and  divisor
  sbb     esi, edx                //Subtract the divisor
  jnc     @NoCarryAtCmpX
  add     edi, ebp                //Return privius dividend state
  adc     esi, edx
@NoCarryAtCmpX:

Peter Cordes · Answer 5 · 2020-02-12T03:52:00.230

I know the question specified 32-bit code, but the answer for 64-bit may be useful or interesting to others.

And yes, 64b/32b => 32b division does make a useful building-block for 128b % 64b => 64b. libgcc's __umoddi3 (source linked below) gives an idea of how to do that sort of thing, but it only implements 2N % 2N => 2N on top of a 2N / N => N division, not 4N % 2N => 2N.

Wider multi-precision libraries are available, e.g. https://gmplib.org/manual/Integer-Division.html#Integer-Division.

GNU C on 64-bit machines does provide an __int128 type, and libgcc functions to multiply and divide as efficiently as possible on the target architecture.

x86-64's div r/m64 instruction does 128b/64b => 64b division (also producing remainder as a second output), but it faults if the quotient overflows. So you can't directly use it if A/B > 2^64-1, but you can get gcc to use it for you (or even inline the same code that libgcc uses).

This compiles (Godbolt compiler explorer) to one or two div instructions (which happen inside a libgcc function call). If there was a faster way, libgcc would probably use that instead.

#include <stdint.h>
uint64_t AmodB(unsigned __int128 A, uint64_t B) {
  return A % B;
}

The __umodti3 function it calls calculates a full 128b/128b modulo, but the implementation of that function does check for the special case where the divisor's high half is 0, as you can see in the libgcc source. (libgcc builds the si/di/ti version of the function from that code, as appropriate for the target architecture. udiv_qrnnd is an inline asm macro that does unsigned 2N/N => N division for the target architecture.

For x86-64 (and other architectures with a hardware divide instruction), the fast-path (when high_half(A) < B; guaranteeing div won't fault) is just two not-taken branches, some fluff for out-of-order CPUs to chew through, and a single div r64 instruction, which takes about 50-100 cycles¹ on modern x86 CPUs, according to Agner Fog's insn tables. Some other work can be happening in parallel with div, but the integer divide unit is not very pipelined and div decodes to a lot of uops (unlike FP division).

The fallback path still only uses two 64-bit div instructions for the case where B is only 64-bit, but A/B doesn't fit in 64 bits so A/B directly would fault.

Note that libgcc's __umodti3 just inlines __udivmoddi4 into a wrapper that only returns the remainder.

Footnote 1: 32-bit div is over 2x faster on Intel CPUs. On AMD CPUs, performance only depends on the size of the actual input values, even if they're small values in a 64-bit register. If small values are common, it might be worth benchmarking a branch to a simple 32-bit division version before doing 64-bit or 128-bit division.

For repeated modulo by the same `B`

It might be worth considering calculating a fixed-point multiplicative inverse for B, if one exists. For example, with compile-time constants, gcc does the optimization for types narrower than 128b.

uint64_t modulo_by_constant64(uint64_t A) { return A % 0x12345678ABULL; }

    movabs  rdx, -2233785418547900415
    mov     rax, rdi
    mul     rdx
    mov     rax, rdx             # wasted instruction, could have kept using RDX.
    movabs  rdx, 78187493547
    shr     rax, 36            # division result
    imul    rax, rdx           # multiply and subtract to get the modulo
    sub     rdi, rax
    mov     rax, rdi
    ret

x86's mul r64 instruction does 64b*64b => 128b (rdx:rax) multiplication, and can be used as a building block to construct a 128b * 128b => 256b multiply to implement the same algorithm. Since we only need the high half of the full 256b result, that saves a few multiplies.

Modern Intel CPUs have very high performance mul: 3c latency, one per clock throughput. However, the exact combination of shifts and adds required varies with the constant, so the general case of calculating a multiplicative inverse at run-time isn't quite as efficient each time its used as a JIT-compiled or statically-compiled version (even on top of the pre-computation overhead).

IDK where the break-even point would be. For JIT-compiling, it will be higher than ~200 reuses, unless you cache generated code for commonly-used B values. For the "normal" way, it might possibly be in the range of 200 reuses, but IDK how expensive it would be to find a modular multiplicative inverse for 128-bit / 64-bit division.

libdivide can do this for you, but only for 32 and 64-bit types. Still, it's probably a good starting point.

Accipitridae · Answer 6 · 2010-04-12T18:44:49.590

4

The solution depends on what exactly you are trying to solve.

E.g. if you are doing arithmetic in a ring modulo a 64-bit integer then using Montgomerys reduction is very efficient. Of course this assumes that you the same modulus many times and that it pays off to convert the elements of the ring into a special representation.

To give just a very rough estimate on the speed of this Montgomerys reduction: I have an old benchmark that performs a modular exponentiation with 64-bit modulus and exponent in 1600 ns on a 2.4Ghz Core 2. This exponentiation does about 96 modular multiplications (and modular reductions) and hence needs about 40 cycles per modular multiplication.

edited Apr 12 '10 at 18:44

answered Apr 06 '10 at 21:40

Accipitridae

3,136
19
9

1

The wikipedia article describes using Montgomery reduction to increase the efficiency of modular multiplication (and, by extension, modular exponentiation). Do you know if the technique still applies in a situation where there are a large number of modular additions as well as multiplications? – user200783 Apr 10 '10 at 10:48
1

Addition is done as usual. If both summands are in Montgomery representation then adding them together gives their sum in Montgomery representation. If this sum is larger than the modulus, just subtract the modulus. – Accipitridae Apr 12 '10 at 18:33

GJ. · Answer 7 · 2010-04-10T17:50:12.093

I have made both version of Mod128by64 'Russian peasant' division function: classic and speed optimised. Speed optimised can do on my 3Ghz PC more than 1000.000 random calculations per second and is more than three times faster than classic function. If we compare the execution time of calculating 128 by 64 and calculating 64 by 64 bit modulo than this function is only about 50% slower.

Classic Russian peasant:

function Mod128by64Clasic(Dividend: PUInt128; Divisor: PUInt64): UInt64;
//In : eax = @Dividend
//   : edx = @Divisor
//Out: eax:edx as Remainder
asm
//Registers inside rutine
//edx:ebp = Divisor
//ecx = Loop counter
//Result = esi:edi
  push    ebx                     //Store registers to stack
  push    esi
  push    edi
  push    ebp
  mov     ebp, [edx]              //Load  divisor to edx:ebp
  mov     edx, [edx + 4]
  mov     ecx, ebp                //Div by 0 test
  or      ecx, edx
  jz      @DivByZero
  push    [eax]                   //Store Divisor to the stack
  push    [eax + 4]
  push    [eax + 8]
  push    [eax + 12]
  xor     edi, edi                //Clear result
  xor     esi, esi
  mov     ecx, 128                //Load shift counter
@Do128BitsShift:
  shl     [esp + 12], 1           //Shift dividend from stack left for one bit
  rcl     [esp + 8], 1
  rcl     [esp + 4], 1
  rcl     [esp], 1
  rcl     edi, 1
  rcl     esi, 1
  setc    bh                      //Save 65th bit
  sub     edi, ebp                //Compare dividend and  divisor
  sbb     esi, edx                //Subtract the divisor
  sbb     bh, 0                   //Use 65th bit in bh
  jnc     @NoCarryAtCmp           //Test...
  add     edi, ebp                //Return privius dividend state
  adc     esi, edx
@NoCarryAtCmp:
  loop    @Do128BitsShift
//End of 128 bit division loop
  mov     eax, edi                //Load result to eax:edx
  mov     edx, esi
@RestoreRegisters:
  lea     esp, esp + 16           //Restore Divisors space on stack
  pop     ebp                     //Restore Registers
  pop     edi                     
  pop     esi
  pop     ebx
  ret
@DivByZero:
  xor     eax, eax                //Here you can raise Div by 0 exception, now function only return 0.
  xor     edx, edx
  jmp     @RestoreRegisters
end;

Speed optimised Russian peasant:

function Mod128by64Oprimized(Dividend: PUInt128; Divisor: PUInt64): UInt64;
//In : eax = @Dividend
//   : edx = @Divisor
//Out: eax:edx as Remainder
asm
//Registers inside rutine
//Divisor = edx:ebp
//Dividend = ebx:edx //We need 64 bits
//Result = esi:edi
//ecx = Loop counter and Dividend index
  push    ebx                     //Store registers to stack
  push    esi
  push    edi
  push    ebp
  mov     ebp, [edx]              //Divisor = edx:ebp
  mov     edx, [edx + 4]
  mov     ecx, ebp                //Div by 0 test
  or      ecx, edx
  jz      @DivByZero
  xor     edi, edi                //Clear result
  xor     esi, esi
//Start of 64 bit division Loop
  mov     ecx, 15                 //Load byte loop shift counter and Dividend index
@SkipShift8Bits:                  //Small Dividend numbers shift optimisation
  cmp     [eax + ecx], ch         //Zero test
  jnz     @EndSkipShiftDividend
  loop    @SkipShift8Bits         //Skip Compute 8 Bits unroled loop ?
@EndSkipShiftDividend:
  test    edx, $FF000000          //Huge Divisor Numbers Shift Optimisation
  jz      @Shift8Bits             //This Divisor is > $00FFFFFF:FFFFFFFF
  mov     ecx, 8                  //Load byte shift counter
  mov     esi, [eax + 12]         //Do fast 56 bit (7 bytes) shift...
  shr     esi, cl                 //esi = $00XXXXXX
  mov     edi, [eax + 9]          //Load for one byte right shifted 32 bit value
@Shift8Bits:
  mov     bl, [eax + ecx]         //Load 8 bit part of Dividend
//Compute 8 Bits unroled loop
  shl     bl, 1                   //Shift dividend left for one bit
  rcl     edi, 1
  rcl     esi, 1
  jc      @DividentAbove0         //dividend hi bit set?
  cmp     esi, edx                //dividend hi part larger?
  jb      @DividentBelow0
  ja      @DividentAbove0
  cmp     edi, ebp                //dividend lo part larger?
  jb      @DividentBelow0
@DividentAbove0:
  sub     edi, ebp                //Return privius dividend state
  sbb     esi, edx
@DividentBelow0:
  shl     bl, 1                   //Shift dividend left for one bit
  rcl     edi, 1
  rcl     esi, 1
  jc      @DividentAbove1         //dividend hi bit set?
  cmp     esi, edx                //dividend hi part larger?
  jb      @DividentBelow1
  ja      @DividentAbove1
  cmp     edi, ebp                //dividend lo part larger?
  jb      @DividentBelow1
@DividentAbove1:
  sub     edi, ebp                //Return privius dividend state
  sbb     esi, edx
@DividentBelow1:
  shl     bl, 1                   //Shift dividend left for one bit
  rcl     edi, 1
  rcl     esi, 1
  jc      @DividentAbove2         //dividend hi bit set?
  cmp     esi, edx                //dividend hi part larger?
  jb      @DividentBelow2
  ja      @DividentAbove2
  cmp     edi, ebp                //dividend lo part larger?
  jb      @DividentBelow2
@DividentAbove2:
  sub     edi, ebp                //Return privius dividend state
  sbb     esi, edx
@DividentBelow2:
  shl     bl, 1                   //Shift dividend left for one bit
  rcl     edi, 1
  rcl     esi, 1
  jc      @DividentAbove3         //dividend hi bit set?
  cmp     esi, edx                //dividend hi part larger?
  jb      @DividentBelow3
  ja      @DividentAbove3
  cmp     edi, ebp                //dividend lo part larger?
  jb      @DividentBelow3
@DividentAbove3:
  sub     edi, ebp                //Return privius dividend state
  sbb     esi, edx
@DividentBelow3:
  shl     bl, 1                   //Shift dividend left for one bit
  rcl     edi, 1
  rcl     esi, 1
  jc      @DividentAbove4         //dividend hi bit set?
  cmp     esi, edx                //dividend hi part larger?
  jb      @DividentBelow4
  ja      @DividentAbove4
  cmp     edi, ebp                //dividend lo part larger?
  jb      @DividentBelow4
@DividentAbove4:
  sub     edi, ebp                //Return privius dividend state
  sbb     esi, edx
@DividentBelow4:
  shl     bl, 1                   //Shift dividend left for one bit
  rcl     edi, 1
  rcl     esi, 1
  jc      @DividentAbove5         //dividend hi bit set?
  cmp     esi, edx                //dividend hi part larger?
  jb      @DividentBelow5
  ja      @DividentAbove5
  cmp     edi, ebp                //dividend lo part larger?
  jb      @DividentBelow5
@DividentAbove5:
  sub     edi, ebp                //Return privius dividend state
  sbb     esi, edx
@DividentBelow5:
  shl     bl, 1                   //Shift dividend left for one bit
  rcl     edi, 1
  rcl     esi, 1
  jc      @DividentAbove6         //dividend hi bit set?
  cmp     esi, edx                //dividend hi part larger?
  jb      @DividentBelow6
  ja      @DividentAbove6
  cmp     edi, ebp                //dividend lo part larger?
  jb      @DividentBelow6
@DividentAbove6:
  sub     edi, ebp                //Return privius dividend state
  sbb     esi, edx
@DividentBelow6:
  shl     bl, 1                   //Shift dividend left for one bit
  rcl     edi, 1
  rcl     esi, 1
  jc      @DividentAbove7         //dividend hi bit set?
  cmp     esi, edx                //dividend hi part larger?
  jb      @DividentBelow7
  ja      @DividentAbove7
  cmp     edi, ebp                //dividend lo part larger?
  jb      @DividentBelow7
@DividentAbove7:
  sub     edi, ebp                //Return privius dividend state
  sbb     esi, edx
@DividentBelow7:
//End of Compute 8 Bits (unroled loop)
  dec     cl                      //Decrement byte loop shift counter
  jns     @Shift8Bits             //Last jump at cl = 0!!!
//End of division loop
  mov     eax, edi                //Load result to eax:edx
  mov     edx, esi
@RestoreRegisters:
  pop     ebp                     //Restore Registers
  pop     edi
  pop     esi
  pop     ebx
  ret
@DivByZero:
  xor     eax, eax                //Here you can raise Div by 0 exception, now function only return 0.
  xor     edx, edx
  jmp     @RestoreRegisters
end;

On modern Intel CPUs, `rcl reg,1` is 3 uops, but `adc reg,reg` reads and writes CF and ZF identically for only 1 uop since Broadwell, or 2 uops on Haswell and earlier. Similarly, `shl bl,1` could be `add bl,bl`. The only advantage there is running on more ports (not the shifter port(s)), which might not be a bottleneck. (`add same,same` is of course a left-shift because `x*2 = x+x`, putting the carry-out in CF. `adc same,same` does that and also adds the input CF, setting the low bit just like RCL.) AMD has fast `rcl`-by-1, though. http://agner.org/optimize/ — Peter Cordes, Apr 15 '18 at 04:41

score 4 · Answer 8 · answered Apr 10 '10 at 23:00

I'd like to share a few thoughts.

It's not as simple as MSN proposes I'm afraid.

In the expression:

(((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B

both multiplication and addition may overflow. I think one could take it into account and still use the general concept with some modifications, but something tells me it's going to get really scary.

I was curious how 64 bit modulo operation was implemented in MSVC and I tried to find something out. I don't really know assembly and all I had available was Express edition, without the source of VC\crt\src\intel\llrem.asm, but I think I managed to get some idea what's going on, after a bit of playing with the debugger and disassembly output. I tried to figure out how the remainder is calculated in case of positive integers and the divisor >=2^32. There is some code that deals with negative numbers of course, but I didn't dig into that.

Here is how I see it:

If divisor >= 2^32 both the dividend and the divisor are shifted right as much as necessary to fit the divisor into 32 bits. In other words: if it takes n digits to write the divisor down in binary and n > 32, n-32 least significant digits of both the divisor and the dividend are discarded. After that, the division is performed using hardware support for dividing 64 bit integers by 32 bit ones. The result might be incorrect, but I think it can be proved, that the result may be off by at most 1. After the division, the divisor (original one) is multiplied by the result and the product subtracted from the dividend. Then it is corrected by adding or subtracting the divisor if necessary (if the result of the division was off by one).

It's easy to divide 128 bit integer by 32 bit one leveraging hardware support for 64-bit by 32-bit division. In case the divisor < 2^32, one can calculate the remainder performing just 4 divisions as follows:

Let's assume the dividend is stored in:

DWORD dividend[4] = ...

the remainder will go into:

DWORD remainder;

1) Divide dividend[3] by divisor. Store the remainder in remainder.
2) Divide QWORD (remainder:dividend[2]) by divisor. Store the remainder in remainder.
3) Divide QWORD (remainder:dividend[1]) by divisor. Store the remainder in remainder.
4) Divide QWORD (remainder:dividend[0]) by divisor. Store the remainder in remainder.

After those 4 steps the variable remainder will hold what You are looking for. (Please don't kill me if I got the endianess wrong. I'm not even a programmer)

In case the divisor is grater than 2^32-1 I don't have good news. I don't have a complete proof that the result after the shift is off by no more than 1, in the procedure I described earlier, which I believe MSVC is using. I think however that it has something to do with the fact, that the part that is discarded is at least 2^31 times less than the divisor, the dividend is less than 2^64 and the divisor is greater than 2^32-1, so the result is less than 2^32.

If the dividend has 128 bits the trick with discarding bits won't work. So in general case the best solution is probably the one proposed by GJ or caf. (Well, it would be probably the best even if discarding bits worked. Division, multiplication subtraction and correction on 128 bit integer might be slower.)

I was also thinking about using the floating point hardware. x87 floating point unit uses 80 bit precision format with fraction 64 bits long. I think one can get the exact result of 64 bit by 64 bit division. (Not the remainder directly, but also the remainder using multiplication and subtraction like in the "MSVC procedure"). IF the dividend >=2^64 and < 2^128 storing it in the floating point format seems similar to discarding least significant bits in "MSVC procedure". Maybe someone can prove the error in that case is bound and find it useful. I have no idea if it has a chance to be faster than GJ's solution, but maybe it's worth it to try.

I think your thinking is more or less correct. Yes the idea about using x87 double-precision floating point division is also known, but the x87 only support 63bit division because the 64th bit is reserved for mantissa sign according: IEEE Standard 754 for Binary Floating-Point Arithmetic. — GJ., Apr 11 '10 at 07:03
I was talking about the Double-Extended format supported by x87. In double format the fraction is only 53 bits long. In the extended one the fraction or rather the significand is 64 bits long. There is a difference between this format and the smaller ones. In extended format the leading bit of the significand is explicit unlike in double or single ones, but I don't think it changes much. It should be possible to store exactly 64 bit integers in this format. The sign is stored in bit 79 in extended format. — Maciej Hehl, Apr 11 '10 at 10:55
I have check the IEEE Standard and you are right. The mantisa sign is stored in last byte. — GJ., Apr 11 '10 at 16:13
What you describe is the so called base case division as described by Knuth in his algorithm D (TAOCP Vol. 2). It relies on the fact that if you divide the top two "digits" of the dividend by the top digit of the divisor, the result is off by at most 2. You test this by subtracting the result * divisor from the dividend/remainder and see if it is negative. If so, you add the divisor and correct the quotient until the remainder is positive again. Then you loop for the next lower digit etc. — Rudy Velthuis, Mar 14 '13 at 12:30
Agree `(((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B` has [problems](http://stackoverflow.com/questions/2566010/fastest-way-to-calculate-a-128-bit-integer-modulo-a-64-bit-integer/39257118#comment65852083_2576795) — chux - Reinstate Monica, Aug 31 '16 at 20:03

score 4 · Answer 9 · answered Aug 31 '16 at 19:00

The accepted answer by @caf was real nice and highly rated, yet it contain a bug not seen for years.

To help test that and other solutions, I am posting a test harness and making it community wiki.

unsigned cafMod(unsigned A, unsigned B) {
  assert(B);
  unsigned X = B;
  // while (X < A / 2) {  Original code used <
  while (X <= A / 2) {
    X <<= 1;
  }
  while (A >= B) {
    if (A >= X) A -= X;
    X >>= 1;
  }
  return A;
}

void cafMod_test(unsigned num, unsigned den) {
  if (den == 0) return;
  unsigned y0 = num % den;
  unsigned y1 = mod(num, den);
  if (y0 != y1) {
    printf("FAIL num:%x den:%x %x %x\n", num, den, y0, y1);
    fflush(stdout);
    exit(-1);
  }
}

unsigned rand_unsigned() {
  unsigned x = (unsigned) rand();
  return x * 2 ^ (unsigned) rand();
}

void cafMod_tests(void) {
  const unsigned i[] = { 0, 1, 2, 3, 0x7FFFFFFF, 0x80000000, 
      UINT_MAX - 3, UINT_MAX - 2, UINT_MAX - 1, UINT_MAX };
  for (unsigned den = 0; den < sizeof i / sizeof i[0]; den++) {
    if (i[den] == 0) continue;
    for (unsigned num = 0; num < sizeof i / sizeof i[0]; num++) {
      cafMod_test(i[num], i[den]);
    }
  }
  cafMod_test(0x8711dd11, 0x4388ee88);
  cafMod_test(0xf64835a1, 0xf64835a);

  time_t t;
  time(&t);
  srand((unsigned) t);
  printf("%u\n", (unsigned) t);fflush(stdout);
  for (long long n = 10000LL * 1000LL * 1000LL; n > 0; n--) {
    cafMod_test(rand_unsigned(), rand_unsigned());
  }

  puts("Done");
}

int main(void) {
  cafMod_tests();
  return 0;
}

score 2 · Answer 10 · answered Apr 08 '10 at 11:03

As a general rule, division is slow and multiplication is faster, and bit shifting is faster yet. From what I have seen of the answers so far, most of the answers have been using a brute force approach using bit-shifts. There exists another way. Whether it is faster remains to be seen (AKA profile it).

Instead of dividing, multiply by the reciprocal. Thus, to discover A % B, first calculate the reciprocal of B ... 1/B. This can be done with a few loops using the Newton-Raphson method of convergence. To do this well will depend upon a good set of initial values in a table.

For more details on the Newton-Raphson method of converging on the reciprocal, please refer to http://en.wikipedia.org/wiki/Division_(digital)

Once you have the reciprocal, the quotient Q = A * 1/B.

The remainder R = A - Q*B.

To determine if this would be faster than the brute force (as there will be many more multiplies since we will be using 32-bit registers to simulate 64-bit and 128-bit numbers, profile it.

If B is constant in your code, you can pre-calculate the reciprocal and simply calculate using the last two formulae. This, I am sure will be faster than bit-shifting.

Hope this helps.

Another approach which may sometimes be even better if e.g. the divisor is 2^64-k for some relatively small k, and the dividend is less than 2^128/k, is to add k to the input value, capture and zero the top 64 bits of the dividend, multiply the captured value by k (for a 96-bit or 128-bit result), and add that to the lower 64 bits of the dividend. If the result is greater than 2^64, repeat. Once the result is less than 2^64, subtract k. For values of k below 2^32 (half the divisor size), two capture-zero-multiply-subtract sequences should suffice. — supercat, May 16 '13 at 18:34
The question is about integer calculations. What if `1/B` (or in integer form, `2^64/B` or `2^128/B`) doesn't have an exact integer representation? — Craig McQueen, Sep 02 '13 at 23:38

CookieNinja · Answer 11 · 2016-11-03T15:27:46.880

If 128-bit unsigned by 63-bit unsigned is good enough, then it can be done in a loop doing at most 63 cycles.

Consider this a proposed solution MSNs' overflow problem by limiting it to 1-bit. We do so by splitting the problem in 2, modular multiplication and adding the results at the end.

In the following example upper corresponds to the most significant 64-bits, lower to the least significant 64-bits and div is the divisor.

unsigned 128_mod(uint64_t upper, uint64_t lower, uint64_t div) {
  uint64_t result = 0;
  uint64_t a = (~0%div)+1;
  upper %= div; // the resulting bit-length determines number of cycles required

  // first we work out modular multiplication of (2^64*upper)%div
  while (upper != 0){
    if(upper&1 == 1){
      result += a;
      if(result >= div){result -= div;}
    }
    a <<= 1;
    if(a >= div){a -= div;}
    upper >>= 1;
  }

  // add up the 2 results and return the modulus
  if(lower>div){lower -= div;}
  return (lower+result)%div;
}

The only problem is that, if the divisor is 64-bits then we get overflows of 1-bit (loss of information) giving a faulty result.

It bugs me that I haven't figured out a neat way to handle the overflows.

HermannSW · Answer 12 · 2021-06-04T14:10:11.437

1

I don't know how to compile the assembler codes, any help is appreciated to compile and test them.

I solved this problem by comparing against gmplib "mpz_mod()" and summing 1 million loop results. It was a long ride to go from slowdown (seedup 0.12) to speedup 1.54 -- that is the reason I think the C codes in this thread will be slow.

Details inclusive test harness in this thread:
https://www.raspberrypi.org/forums/viewtopic.php?f=33&t=311893&p=1873122#p1873122

This is "mod_256()" with speedup over using gmplib "mpz_mod()", use of __builtin_clzll() for longer shifts was essential:

typedef __uint128_t uint256_t[2];

#define min(x, y) ((x<y) ? (x) : (y))

int clz(__uint128_t u)
{
//  unsigned long long h = ((unsigned long long *)&u)[1];
  unsigned long long h = u >> 64;
  return (h!=0) ? __builtin_clzll(h) : 64 + __builtin_clzll(u);
}

__uint128_t mod_256(uint256_t x, __uint128_t n)
{
  if (x[1] == 0)  return x[0] % n;
  else
  {
    __uint128_t r = x[1] % n;
    int F = clz(n);
    int R = clz(r);
    for(int i=0; i<128; ++i)
    {
      if (R>F+1)
      {
        int h = min(R-(F+1), 128-i);
        r <<= h; R-=h; i+=(h-1); continue;
      }
      r <<= 1; if (r >= n)  { r -= n; R=clz(r); }
    }
    r += (x[0] % n); if (r >= n)  r -= n;

    return r;
  }
}

edited Jun 04 '21 at 14:10

answered Jun 03 '21 at 18:15

HermannSW

161
1
8

`((unsigned long long *)&u)[1];` isn't safe unless you compile with `-fno-strict-aliasing`. Use `u>>64` GNU C compilers that support `unsigned __int128` in the first place like GCC and clang will do a good job with it. – Peter Cordes Jun 03 '21 at 20:54
Both statements get compiled to exactly same assembler instruction: https://godbolt.org/z/vzG38h9ha – HermannSW Jun 04 '21 at 13:59
Exactly. So pick the one that's guaranteed not to break with different surrounding code, is more readable, and isn't endian-dependent (e.g. on MIPS64 / PowerPC64 are often big-endian). `u>>64`. The whole point of *undefined* behaviour is that it's not *guaranteed* to break in every case, just that it can. Showing a case where it happens to work proves nothing. – Peter Cordes Jun 04 '21 at 14:01
OK, I buy that, and changed statement in similar function: https://gist.github.com/Hermann-SW/6fe3f137e5aac288c026859452638294#file-gcd-c-L23 In same function, is assignment to "h" for accessing low 64bit safe? inline int ctz(__uint128_t u) { unsigned long long h = u; ... – HermannSW Jun 04 '21 at 14:06
Yes, assignment to an unsigned type, from an *integral* type whose value is too large to fit, is guaranteed to do modulo reduction by the type-max to make the value fit. (i.e. truncate on normal systems where the max is a power of 2). Fun fact: that only happens when assigning from integral types; it's UB for huge floating-point values. And there's of course no strict-aliasing UB because pointers aren't involved anymore. – Peter Cordes Jun 04 '21 at 14:23

score 0 · Answer 13 · answered Apr 02 '10 at 20:08

0

If you have a recent x86 machine, there are 128-bit registers for SSE2+. I've never tried to write assembly for anything other than basic x86, but I suspect there are some guides out there.

answered Apr 02 '10 at 20:08

Adam Shiemke

3,734
2
22
23

8

The `xmm` registers are not useful for this type of operation, as they aren't true 128-bit GPRs; they're a bunch of smaller registers packed together for vectorized operations. – kquinn Apr 02 '10 at 20:34
1

there are 128-bit integer instructions in SSE2. as far as I can tell from the reference manuals, there's no reason they wouldn't be useful for this. There's a multiply, add/subtract, and shift. – Ben Collins Apr 07 '10 at 12:27
@Ben: In my (brief) look through the Intel manuals I was unable to find a 128-bit integer addition instruction. Do you know what this instruction is called? – user200783 Apr 10 '10 at 10:41
@Paul: PMULUDQ, PADDQ, PSUBQ, PSLLDQ, PSRLDQ. There's an overview listing on v 1, p. 5-25 of the software developer's manual. – Ben Collins Apr 10 '10 at 14:43
4

I have looked at those instructions in volume 2 of the Software Developer's Manual and it seems to me that only PSLLDQ and PSRLDQ treat an xmm register as a 128-bit integer. PADDQ and PSUBQ, by contrast, seem to treat an xmm register as "packed quadwords" (i.e. a pair of 64-bit integers). Is this not correct? – user200783 Apr 10 '10 at 15:18
1

@BenCollins SIMD registers are for operating on **multiple values at once**. You can't use it as a single 128-bit value. See [What are the 128-bit to 512-bit registers used for?](https://stackoverflow.com/q/52932539/995714), [Is it possible to use SSE and SSE2 to make a 128-bit wide integer?](https://stackoverflow.com/q/12200698/995714) – phuclv Aug 11 '20 at 08:08
@phuclv I recognize and understand what SIMD is for. My point was that there are limited circumstances in which you can use the registers this way, and the questions you linked actually point out that there are certain limited circumstances in which you _can_ actually use SIMD registers for wide integers. – Ben Collins Aug 18 '20 at 14:53

Calimero · Answer 14 · 2021-08-26T15:45:03.167

0

I am 9 years after the battle but here is an interesting O(1) edge case for powers of 2 that is worth mentioning.

#include <stdio.h>
// example with 32 bits and 8 bits.
int main() {
    int i = 930;
    unsigned char b = (unsigned char) i;
    printf("%d", (int) b); // 162, same as 930 % 256
}

edited Aug 26 '21 at 15:45

answered Aug 26 '21 at 15:38

Calimero

161
1
8

The question is about divisors that *fit* in a 64-bit integer. `256` doesn't fit in 8 bits, so this isn't an example of 32-bit % 8-bit. But yes, anything `% 256` is equivalent to `& 0xFF`, just taking the low byte, that's a well-known fact for divisors that are powers of 2 when working with binary numbers. (i.e. integers in computers.) – Peter Cordes Aug 26 '21 at 17:00

score -2 · Answer 15 · answered Apr 02 '10 at 11:04

-2

Since there is no predefined 128-bit integer type in C, bits of A have to be represented in an array. Although B (64-bit integer) can be stored in an unsigned long long int variable, it is needed to put bits of B into another array in order to work on A and B efficiently.

After that, B is incremented as Bx2, Bx3, Bx4, ... until it is the greatest B less than A. And then (A-B) can be calculated, using some subtraction knowledge for base 2.

Is this the kind of solution that you are looking for?

answered Apr 02 '10 at 11:04

Ahmet Altun

175
1
11

1

That doesn't sound very efficient. It has the potential of taking O(2^128), if B is small and A is large. – Avi Apr 02 '10 at 11:25
The complexity of algorithm can be reduced by incrementing B using left shifting of bytes. It means multiplication by 2 each time. When the B is greater than A, starting from the previous value of B, B can be incremented by initial value of B each time and so on... – Ahmet Altun Apr 02 '10 at 11:35

Fastest way to calculate a 128-bit integer modulo a 64-bit integer

15 Answers15

For repeated modulo by the same `B`

Linked

Related

Fastest way to calculate a 128-bit integer modulo a 64-bit integer

15 Answers15

For repeated modulo by the same B

Linked

Related

For repeated modulo by the same `B`