Karatsuba multiplication algorithm with modular reduction

Question

I am trying to estimate the time it takes to multiply two large bit numbers modulo some large prime. My computational power is limited to adding, multiplying, and storing 32-bit numbers.

Due to this I wanted to use the Karatsuba algorithm since the multiplication is reduced to single-digit operations. However, the algorithm does not incorporate modular reduction.

My other thought would be to use Montgomery reduction but I'm not sure how this would perform given my computational limitations.

Which algorithm would you suggest I use?

karatsuba is not single digit! instead it use more digit sizes ... from 1 up to ~solution_digits/2 so you need variable bit size arithmetics. If you will have only very big numbers and do not want to implement variable bit wide arithmetics then I would use Schönhage-Strassen multiplication based on NTT transform http://stackoverflow.com/q/18577076/2521214 for small numbers use naive O(n^2) multiplication — Spektre, Mar 08 '15 at 09:36
@Spektre According to the Wikipedia page for the Karatsuba algorithm, it reduces the multiplication of two n-digit numbers to at most n^1.585 single-digit multiplications in general. Is this not correct? — Dr. Van Nostrand, Mar 09 '15 at 14:51
yes that is probably true, but karatsuba starts with 1 digit operations then 2 digits, then 4 digits .... so you need variable digit arithmetics not just 32bit to achieve that. It is possible to encode it from 32bit ALU blocks but that is slowing things down a bit (when I tried to encode it the recursion was slowing thing a lot) and the coding could be complicated if you do not know what you are doing... on the other hand you can limit the operations to modulo prime (which was not mine case) — Spektre, Mar 09 '15 at 15:02
added non-answer with mine karatsuba implementation so you see what I mean ... — Spektre, Mar 09 '15 at 15:18

score 0 · Answer 1 · edited May 23 '17 at 12:29

This is not an answer but this would be unreadable inside comment

it is just illustration how Karatsuba code looks like

This is mine implementation of Karatsuba multiplication for arbitrary floating point numbers

//---------------------------------------------------------------------------
void arbnum::_mul_karatsuba(DWORD *z,DWORD *x,DWORD *y,int n)
    {
    // recursion for karatsuba
    // z[2n]=x[n]*y[n]; n=2^m
    int i;    for (i=0;i<n;i++) if (x[i]) { i=-1; break; }      // x==0 ?
    if (i< 0) for (i=0;i<n;i++) if (y[i]) { i=-1; break; }      // y==0 ?
    if (i>=0){for (i=0;i<n+n;i++) z[i]=0; return; }             // 0.? = 0
    if (n==1) { alu.mul(z[0],z[1],x[0],y[0]); return; }
    if (n< 1) return;
    int n2=n>>1;
    _mul_karatsuba(z+n,x+n2,y+n2,n2);                           // z0 = x0.y0
    _mul_karatsuba(z  ,x   ,y   ,n2);                           // z2 = x1.y1
    DWORD *q=new DWORD[n<<1],*q0,*q1,*qq; BYTE cx,cy;
    if (q==NULL) { _error(_arbnum_error_NotEnoughMemory); return; }
    #define _add { alu.add(qq[i],q0[i],q1[i]); for (i--;i>=0;i--) alu.adc(qq[i],q0[i],q1[i]); } // qq=q0+q1 ...[i..0]
    #define _sub { alu.sub(qq[i],q0[i],q1[i]); for (i--;i>=0;i--) alu.sbc(qq[i],q0[i],q1[i]); } // qq=q0-q1 ...[i..0]
    qq=q;    q0=x+n2; q1=x;   i=n2-1; _add; cx=alu.cy;          // =x0+x1
    qq=q+n2; q0=y+n2; q1=y;   i=n2-1; _add; cy=alu.cy;          // =y0+y1
    _mul_karatsuba(q+n,q+n2,q,n2);                              // =(x0+x1)(y0+y1) mod ((2^N)-1)
    if (cx) { qq=q+n; q0=qq; q1=q+n2; i=n2-1; _add; cx=alu.cy; }// +=cx*(y0+y1)<<n2
    if (cy) { qq=q+n; q0=qq; q1=q   ; i=n2-1; _add; cy=alu.cy; }// +=cy*(x0+x1)<<n2
    qq=q+n;  q0=qq;   q1=z+n; i=n-1;  _sub;                     // -=z0
    qq=q+n;  q0=qq;   q1=z;   i=n-1;  _sub;                     // -=z2
    qq=z+n2; q0=qq;   q1=q+n; i=n-1;  _add;                     // z1=(x0+x1)(y0+y1)-z0-z2
    for (i=n2-1;i>=0;i--) if (alu.cy) alu.inc(z[i]); else break;                // adc(1)
    alu.cy=cx|cy; for (i=n2-1;i>=0;i--) if (alu.cy) alu.inc(z[i]); else break;  // adc(1) if +=cx*(y0+y1)<<n2 or +=cy*(x0+x1)<<n2 overflowed
    delete[] q;
    #undef _add
    #undef _sub
    }
//---------------------------------------------------------------------------
void arbnum::mul_karatsuba(const arbnum &x,const arbnum &y)
    {
    // O(3*(N)^log2(3)) ~ O(3*(N^1.585))
    // Karatsuba multiplication
    int s=x.sig*y.sig;
    arbnum a,b; a=x; b=y; a.sig=+1; b.sig=+1;
    int i,n;
    for (n=1;(n<a.siz)||(n<b.siz);n<<=1);
    a._realloc(n);
    b._realloc(n);
    _alloc(n+n); for (i=0;i<siz;i++) dat[i]=0;
    _mul_karatsuba(dat,a.dat,b.dat,n);
    bits=siz<<5;
    sig=s;
    exp=a.exp+b.exp+((siz-a.siz-b.siz)<<5)+1;
//  _normalize();
    }
//---------------------------------------------------------------------------

no modulo is present
as you can see you will need +,- operations with variable bit width both are encoded as #define so no need for another function calls ...
it uses 32bit ALU in x86 C++ as a block to build variable bit width arithmetics
result is in this object
_mul_karatsuba is just internal recursive function
mul_karatsuba is the main api ...

Number representation:

// dat is MSDW first ... LSDW last
DWORD *dat; int siz,exp,sig,bits;

dat[siz] is the mantisa LSDW means least significant DWORD
exp is exponent of MSB of dat[0]

first nonzero bit is present in mantisa !!!

// |-----|---------------------------|---------------|------|
// | sig | MSB      mantisa      LSB |   exponent    | bits |
// |-----|---------------------------|---------------|------|
// | +1  | 0.(0      ...          0) | 2^0           |   0  | +zero
// | -1  | 0.(0      ...          0) | 2^0           |   0  | -zero
// |-----|---------------------------|---------------|------|
// | +1  | 1.(dat[0] ... dat[siz-1]) | 2^exp         |   n  | +number
// | -1  | 1.(dat[0] ... dat[siz-1]) | 2^exp         |   n  | -number
// |-----|---------------------------|---------------|------|
// | +1  | 1.0                       | 2^+0x7FFFFFFE |   1  | +infinity
// | -1  | 1.0                       | 2^+0x7FFFFFFE |   1  | -infinity
// |-----|---------------------------|---------------|------|

Please explicate `over/underflow LSB of each 32bit chunk of [mantissa]`. Does this happen in the LSDW, too? — greybeard, Oct 19 '15 at 11:08
@greybeard I added example of error to Answer ... also I was wrong the error can propagate (on 20*32 bit operand I saw it also in `1th` bit once). I do not see the error in LSDW now but I think it was there on some test I made in the past but not sure now ... — Spektre, Oct 19 '15 at 11:38
Come to think of it: `(x0+x1)(y0+y1) mod 2^N`. What about the case where both `cx` and `cy` are non-zero? Shouldn't that necessitate two increments or equivalent? To be updated by carries from the _subs in due turn. Either handling these in `z`, or extending `*p` by one. — greybeard, Nov 10 '15 at 14:30
@greybeard Wow thanks man you got it .... no matter how long I spend on this I newer spot that. Have repaired the code (based on your comment in just few minutes :) ) by adding another increment if the stage in question overflowed (does not mean which part) looks like it works for all scenarios (none, one and both operations overflowed) so now the result matches all the other multiplication implementations. Thanks again. It could be further optimized by merging the last increments together ... Also repaired the `MSDW/LSDW` order was reversed in text ... — Spektre, Nov 10 '15 at 15:51
@greybeard I also updated code here [Fast bignum square computation](http://stackoverflow.com/q/18465326/2521214) — Spektre, Nov 10 '15 at 16:20

Karatsuba multiplication algorithm with modular reduction

1 Answers1