BigInteger numbers implementation and performance

Question

I have written a BigInteger class in C++ that should be able to do operations on all numbers with any size. Currently I am trying to achieve a very fast multiplication method by comparing the existing algorithms and test for which amount of digits they work best and I ran into very unexpected results.I tried to do 20 multiplications of 500-digit and I timed them. This was the result:

karatsuba:
  14.178 seconds

long multiplication:
  0.879 seconds

Wikipedia told me

It follows that, for sufficiently large n, Karatsuba's algorithm will perform fewer shifts and single-digit additions than longhand multiplication, even though its basic step uses more additions and shifts than the straightforward formula. For small values of n, however, the extra shift and add operations may make it run slower than the longhand method. The point of positive return depends on the computer platform and context. As a rule of thumb, Karatsuba is usually faster when the multiplicands are longer than 320–640 bits.

Since my numbers are at least 1500 bits long this is quite unexpected because wikipedia said karatsuba should run faster. I believe that my problem might be in my addition algorithm but I don't see how I could make it faster because it's already running in O(n). Ill post my code below so that you can check my implementations. I'll leave the irrelevant parts out of it.
I was also thinking that maybe the structure I used was not the best possible. I represented each data segment in little endian. So for example if I have the number 123456789101112 stored into data segments of length 3 it would look like this:

{112,101,789,456,123}

so this is why I am asking now what the best structure and best way to implement a BigInteger class is? And why is the karatsuba algorithm slower than the long multiplication one ?

This is my code: (I'm sorry for the length)

using namespace std;

bool __longmult=true;
bool __karatsuba=false;

struct BigInt {
public:
    vector <int> digits;

    BigInt(const char * number) {
        //constructor is not relevant   
    }
    BigInt() {}

    void BigInt::operator = (BigInt a) {
        digits=a.digits;
    }

    friend BigInt operator + (BigInt,BigInt);
    friend BigInt operator * (BigInt,BigInt);

    friend ostream& operator << (ostream&,BigInt);
};

BigInt operator + (BigInt a,BigInt b) {
    if (b.digits.size()>a.digits.size()) {
        a.digits.swap(b.digits); //make sure a has more or equal amount of digits than b
    }
    int carry=0;

    for (unsigned int i=0;i<a.digits.size();i++) {
        int sum;
        if (i<b.digits.size()) {
            sum=b.digits[i]+a.digits[i]+carry;
        } else if (carry==1) {
            sum=a.digits[i]+carry;
        } else {
            break; // if carry is 0 and no more digits in b are left then we are done already
        }

        if (sum>=1000000000) {
            a.digits[i]=sum-1000000000;
            carry=1;
        } else {
            a.digits[i]=sum;
            carry=0;
        }
    }

    if (carry) {
        a.digits.push_back(1);
    }

    return a;
}

BigInt operator * (BigInt a,BigInt b) {
    if (__longmult) {
        BigInt res;
        for (unsigned int i=0;i<b.digits.size();i++) {
            BigInt temp;
            temp.digits.insert(temp.digits.end(),i,0); //shift to left for i 'digits'

            int carry=0;
            for (unsigned int j=0;j<a.digits.size();j++) {
                long long prod=b.digits[i];
                prod*=a.digits[j];
                prod+=carry;
                int t=prod%1000000000;
                temp.digits.push_back(t);
                carry=(prod-t)/1000000000;
            }
            if (carry>0) {
                temp.digits.push_back(carry);
            }
            res+=temp;
        }
        return res;
    } else if (__karatsuba) {
        BigInt res;
        BigInt a1,a0,b1,b0;
        assert(a.digits.size()>0 && b.digits.size()>0);
        while (a.digits.size()!=b.digits.size()) { //add zeroes for equal size
            if (a.digits.size()>b.digits.size()) {
                b.digits.push_back(0);
            } else {
                a.digits.push_back(0);
            }
        }

        if (a.digits.size()==1) {
            long long prod=a.digits[0];
            prod*=b.digits[0];

            res=prod;//conversion from long long to BigInt runs in constant time
            return res;

        } else {
            for (unsigned int i=0;i<a.digits.size();i++) {
                if (i<(a.digits.size()+(a.digits.size()&1))/2) { //split the number in 2 equal parts
                    a0.digits.push_back(a.digits[i]);
                    b0.digits.push_back(b.digits[i]);
                } else {
                    a1.digits.push_back(a.digits[i]);
                    b1.digits.push_back(b.digits[i]);
                }
            }
        }

        BigInt z2=a1*b1;
        BigInt z0=a0*b0;
        BigInt z1 = (a1 + a0)*(b1 + b0) - z2 - z0;

        if (z2==0 && z1==0) {
            res=z0;
        } else if (z2==0) {
            z1.digits.insert(z1.digits.begin(),a0.digits.size(),0);
            res=z1+z0;
        } else {
            z1.digits.insert(z1.digits.begin(),a0.digits.size(),0);
            z2.digits.insert(z2.digits.begin(),2*a0.digits.size(),0);
            res=z2+z1+z0;
        }

        return res;
    }
}

int main() {
    clock_t start, end;

    BigInt a("984561231354629875468546546534125215534125215634987498548489456125215421563498749854848945612385663498749854848945612521542156349874985484894561238561698774565123165221393856169877456512316552156349874985484894561238561698774565123165221392213935215634987498548489456123856169877456512316522139521563498749854848945612385616987745651231652213949651465123151354686324848945612385616987745651231652213949651465123151354684132319321005482265341252156349874985484894561252154215634987498548489456123856264596162131");
    BigInt b("453412521563498749853412521563498749854848945612521542156349874985484894561238565484894561252154215634987498548489456123856848945612385616935462987546854521563498749854848945653412521563498749854848945612521542156349874985484894561238561238754579785616987745651231652213965465341235215634987495215634987498548489456123856169877456512316522139854848774565123165223546298754685465465341235215634987498548354629875468546546534123521563498749854844139496514651231513546298754685465465341235215634987498548435468");

    __longmult=false;
    __karatsuba=true;

    start=clock();
    for (int i=0;i<20;i++) {
        a*b;
    }
    end=clock();
    printf("\nTook %f seconds\n", (double)(end-start)/CLOCKS_PER_SEC);

    __longmult=true;
    __karatsuba=false;

    start=clock();
    for (int i=0;i<20;i++) {
        a*b;
    }
    end=clock();
    printf("\nTook %f seconds\n", (double)(end-start)/CLOCKS_PER_SEC);

    return 0;
}

It appears that you are considering numbers < 10^9 as a digit. This IMO will make karatsuba algorithm closer to naive multiplication. Try doing this considering numbers < 10 as a digit. Moreover, are you sure that the maximum value of carry can be at most 1? — Abhishek Bansal, May 31 '14 at 07:08
That is exactly what I had done earlier. Result: single 500-digit multiplication in 16 seconds... I believe it is convenient to use <10^9 as digits because those values can be multiplied in just a few clocks in hardware and It is always better to have something in hardware than in software. Yes, worst case is 999 999 999 + 999 999 999 and this gives 1 999 999 998 which gives the carry value of 1. — gelatine1, May 31 '14 at 07:12
Can you show your entire code so that someone can try it out. — Abhishek Bansal, May 31 '14 at 07:35
Also, you may be having an int overflow depending on your system. — Abhishek Bansal, May 31 '14 at 07:39
You should use a profiler to find out where the time is being spent — Oliver Charlesworth, May 31 '14 at 08:07
Entire code availaible here: (654 lines) http://codepad.org/dzvPbqIV — gelatine1, May 31 '14 at 08:12

score 3 · Answer 1 · edited May 23 '17 at 11:55

You use std::vector

for your digits make sure there are no unnecessary reallocations in it. So allocate space before operation to avoid it. Also I do not use it so I do not know the array range checking slowdowns.

Check if you do not shift it !!! which is O(N) ... i.e. insert to first position...

Optimize your implementation

here you can find mine implementation optimized an unoptimized for comparison

x=0.98765588997654321000000009876... | 98*32 bits...
mul1[ 363.472 ms ]... O(N^2) classic multiplication
mul2[ 349.384 ms ]... O(3*(N^log2(3))) optimized karatsuba multiplication
mul3[ 9345.127 ms]... O(3*(N^log2(3))) unoptimized karatsuba multiplication

mine implementation threshold for Karatsuba is around 3100bits ... ~ 944 digits!!! The more optimized the code the lover threshold is.

Try to remove unnecessary data from function operands

//BigInt operator + (BigInt a,BigInt b)
BigInt operator + (const BigInt &a,const BigInt &b)

this is way you will not create another copy of a,b on heap in every + call also even faster is this:

mul(BigInt &ab,const BigInt &a,const BigInt &b) // ab = a*b

Schönhage-Strassen multiplication

this one is FFT or NTT based. Mine threshold for it is big ... ~ 49700bits ... ~ 15000digits so if you do not plan to use such big numbers then forget about it. Implementation is also in the link above.

here is mine NTT implementation (optimized as much as I could)
Summary

Does not matter if you use little or big endian but you should code your operations in a way that they do not use insert operations.

You use decadic base for digits that is slow because you need to use division and modulo operations. If you choose base as power of 2 then just bit operations are enough and also it removes many if statements from code which are slowing all the most. If you need the base as power of 10 then use biggest you can in some cases this reduce the div,mod to few subtractions
```
2^32 = 4 294 967 296 ... int = +/- 2147483648
base = 1 000 000 000

//x%=base
while (x>=base) x-=base;
```
max number of cycles is 2^32/base or 2^31/base on some platform is this faster then modulo and also the bigger the base the less operations you need but beware the overflows !!!

So instead of `vector digits;` you would suggest to use `vector digits` and use it as a binary representation ? — gelatine1, May 31 '14 at 08:42
@gelatine1 NO you still can use vector but the base will be not 1000000000 but 0x100000000, the only thing you need to add is conversion between dec and hex strings like here: http://stackoverflow.com/a/18231860/2521214 I prefer hex representation because it is much much faster (no divisions just bit operations) for print and even assign operations. Of course for big numbers this conversion is slower then operation on them itself so be careful what times are you measuring — Spektre, May 31 '14 at 08:49

BigInteger numbers implementation and performance

1 Answers1

Linked