Why is my C bigint modexp so slow, or, why is it so fast in python?

Question

As a learning experience I decided to cook up my own bigint routines for taking modexp, in C. Code below, but to summarize: it's repeated squaring with Montgomery multiplication. And it takes forever to run. On my virtual machines which has 1gb ram, anything bigger than around a 128-bit prime modulus causes the program to kill itself, and smaller ones still take an appreciable amount of time. Meanwhile, in python, I did

def modexp ( g, u, p ):
   s = 1
   while u != 0:
      if u & 1:
         s = (s * g)%p
      u >>= 1
      g = (g * g)%p;
   return s
p = 0xffffffffffffffffc90fdaa22168c234c4c6628b80dc1cd129024e088a67cc74020bbea63b139b22514a08798e3404ddef9519b3cd3a431b302b0a6df25f14374fe1356d6d51c245e485b576625e7ec6f44c42e9a637ed6b0bff5cb6f406b7edee386bfb5a899fa5ae9f24117c4b1fe649286651ece45b3dc2007cb8a163bf0598da48361c55d39a69163fa8fd24cf5f83655d23dca3ad961c62f356208552bb9ed529077096966d670c354e4abc9804f1746c08ca237327ffffffffffffffff
g = 0x02
a = int(urandom(1000).encode('hex'),base=16) % p
modexp(g,a,p)

And it's, like, instant. Does python just have a ton of stuff precomputed for dealing with bigints, or is there something very deep going on, or is it the most likely outcome once again, that is to say, I just have no idea what I'm doing?

Here's some of the relevant C code, not that I'm asking anyone to read through it all, but may as well include it...

Bigints are stored in a typedef struct string which is just a wrapper for a char * with len and sign fields.

string __modExpOdd(string a, string e, string n){//Assumes the  modulus is odd. Repeated squaring with Mongtomery multiplication.
    string r = charToS(0x01);
    int rpow=0;
    int ebits  = countSignificantBits(e);
    while(__bigIntComp(r,n)==-1){
        r = stringLeftShift(r,1);
        rpow++;
    }
    string nprime = extendedEuclidean(r,n,charToS(0x01))[1];
    nprime.sign = 1;
    string abar = bigIntDivide(bigIntMultiply(a,r),n)[1];
    string xbar = bigIntDivide(r,n)[1];
    for(int i = ebits-1; i>=0; i--){
        xbar = __monPro(xbar, xbar, n, nprime, rpow);
        if(bigIntParity(stringRightShift(e,i))){
            xbar = __monPro(abar, xbar, n, nprime, rpow);
        }
    }
    return  __monPro(xbar, charToS(0x01), n, nprime, rpow);
}

Debugging tells me that worst bottleneck is the division step in the main loop of the extended euclidean algorithm:

string *extendedEuclidean(string A, string B, string C){
//returns x, y, with 0 <= x < B such that Ax + By = C
    string S,s,T,t,R,r,q,temp;

    S=charToS(0x00);
    s=charToS(0x01);
    T=charToS(0x01);
    t=charToS(0x00);
    R=bigIntCopy(B);
    r=bigIntCopy(A);
    int revflag = 0;
    if(bigIntComp(A,B)==-1){// need to have A>B for this to work.
        temp = R;
        R = r;
        r = temp;
        revflag = 1;
    }

    while(bigIntComp(R,charToS(0x00))!=0){
        q = bigIntDivide(r,R)[0];
        memcpy(&temp, &R, sizeof(string)); R = bigIntSubtract(r, bigIntMultiply(q, R)); memcpy(&r, &temp, sizeof(string));
        memcpy(&temp, &S, sizeof(string)); S = bigIntSubtract(s, bigIntMultiply(q, S)); memcpy(&s, &temp, sizeof(string));
        memcpy(&temp, &T, sizeof(string)); T = bigIntSubtract(t, bigIntMultiply(q, T)); memcpy(&t, &temp, sizeof(string));
    }
    //at this point, s*A + t*B = r = gcd(A,B), and T,S are quotients of A,B by the gcd.
    string *qr = bigIntDivide(C,r);
    string ratio = bigIntCopy(qr[0]);
    if(bigIntComp(qr[1],charToS(0x00))!=0) return NULL;
    if(revflag==1){
        temp = s;
        s = t;
        t = temp;
    }

    //normalise s, t so that 0 <= s < B
    if(s.sign==-1){
        qr = bigIntDivide(bigIntAbs(s),bigIntAbs(B));
        string q = (bigIntComp(qr[1],charToS(0x00))==0) ? qr[0] : bigIntIncr(qr[0]);
        s = bigIntAdd(s,bigIntMultiply(q,bigIntAbs(B)));
        t = bigIntSubtract(t, bigIntMultiply(q,bigIntAbs(A)));
    }
    //multiply to get the correct coefficients
    s = bigIntMultiply(s,ratio);
    t = bigIntMultiply(t,ratio);
    string *ret = calloc(2,sizeof(string));
    ret[0] = s;
    ret[1] = t;
    return ret;
}

And you might be curious to see the division routine, which is just long division:

string *__bigIntDivide(string a, string b){ //assumes a, b positive
    string *qr = calloc(2, sizeof(string));
    if(__bigIntComp(b,charToS(0x00))==0){
        return NULL;
    }
    if(__bigIntComp(a,b)==-1){
        qr[0] = charToS(0x00);
        qr[1] = bigIntCopy(b);
        return qr;
    }
    if(__bigIntComp(a,b)==0){
        qr[0] = charToS(0x01);
        qr[1] = charToS(0x00);
        return qr;
    }
    int pow = ispowerof2(b);
    if(pow!=-1){
        qr[0] = stringRightShift(a,pow);
        qr[1] = getSignificantBits(a,pow);
        qr[0].sign=1;
        qr[1].sign=1;
        return qr;
    }
    //long division, in base-256
    string divisor = charToS(a.c[0]);
    string quotient = NULLSTRING;
    int i=1;
    do{

        while(__bigIntComp(divisor,b)==-1 && i<a.len){ 
            divisor = stringCat(divisor, charToS(a.c[i]));
            i++;
        }
        if(i == a.len && __bigIntComp(divisor, b) == -1){
            break;
        }
        qr = __findQuotient(divisor, b, charToS(0x00), charToS(0xFF));
        quotient = stringCat(quotient, qr[0]);
        divisor = qr[1];
    }while(i<a.len);
    qr[0] = bigIntCopy(quotient);
    qr[1] = bigIntCopy(divisor);
    return qr;
}

string *__findQuotient(string divisor, string dividend, string minq, string maxq){ //an O(log n) division routine, finds q by a binary search instead of linear
    string q = stringRightShift(bigIntAdd(minq,maxq),1);
    string diff = stringRightShift(bigIntSubtract(maxq,minq),1);
    string p = bigIntMultiply(dividend, q);
    string *qr=calloc(2,sizeof(string));

    while( __bigIntComp(diff, charToS(0x00)) == 1){
        fflush(stdout);
        if(__bigIntComp(divisor, p) == 1){ // if divisor > dividend *q, q is too small.
            q = bigIntAdd(q, diff);
        }
        if(__bigIntComp(divisor, p) == -1){ // if divisor < dividend * q, q is too big.
            q = bigIntSubtract(q, diff);
        }
        if(__bigIntComp(divisor, p)==0){
            qr[0] = q;
            qr[1] = charToS(0x00);
            return qr;
        }
        p = bigIntMultiply(dividend, q);
        diff = stringRightShift(diff,1);
    }
    while(__bigIntComp(divisor, p)==1){ // if a > b*q, q is too small, so increment it. Afterwards, it will be 1 too big.
        q = bigIntIncr(q);
        p = bigIntAdd(p, dividend);
    }
    while(__bigIntComp(divisor, p) == -1){ // while a < b*q, decrement q
        q = bigIntDecr(q);
        p = bigIntSubtract(p,dividend);
    }
    qr[0] = q;
    qr[1] = bigIntSubtract(divisor, p);
    return qr;
}

And the Montgomery product function, which is just directly from the paper where it was first described:

string __monPro(string abar, string bbar, string n, string nprime, int rpow){
    string t = bigIntMultiply(abar, bbar);
    string m = getSignificantBits(bigIntMultiply(t,nprime),rpow);
    string u = stringRightShift(bigIntAdd(t,bigIntMultiply(m,n)),rpow);
    if(bigIntComp(u,n)>=0) return bigIntSubtract(u,n);
    return u;
}

So I'm just curious if it's all precomputed in python, or if I should be looking for a memory leak? I haven't tested it with any of the C bigint libraries because I've been lazy, and because I assume python is just using one of those internally anyway. Thanks for taking a look.

See http://stackoverflow.com/questions/5246856/how-did-python-implement-the-built-in-function-pow — Hugh Bothwell, Jan 21 '16 at 00:25
Can you show us your full code? Quite a few functions are missing. Also, it doesn't look like you are ever deallocating memory. Why is that so? — fuz, Jan 21 '16 at 00:29
Handling large integers as strings isn't going to be particularly fast. I would try storing them as an array of `uint32_t` integers instead. Or just use a library; GMP is ideal for this sort of thing. — r3mainer, Jan 21 '16 at 00:29
@FUZxxl, full code is at https://github.com/gmossessian/CStringUtils. Most of the routines here are in CStringUtils_bigInt.c. You're absolutely right that I'm never deallocating memory, it's because it's poorly designed code which I've been writing as I've been learning, and badly in need of being redesigned. — gmoss, Jan 21 '16 at 00:32
@squeamishossifrage why is handling a bigint as a `char *` going to be slower than handling it as a `uint32_t *`? Isn't it just going to be the same block of bytes in memory either way? — gmoss, Jan 21 '16 at 00:33
You can multiply two 32-bit numbers and get a 64-bit result in one operation. Doing this in 8-bit chunks would take a lot longer. — r3mainer, Jan 21 '16 at 00:35
What platform and compiler for your C code? What are the optimizations? — Andrew Henle, Jan 21 '16 at 00:38
As you acknowledge yourself this is badly written code, why do you benchmark it at all? First write the final code, test and debug it, then identify bottlenecks and in the pre-last step optimise. The last step would be debugging/testing. As others already stated, strings are for character sequences, integers are for - integers. Don't use oranges to make an apple pie. — too honest for this site, Jan 21 '16 at 01:53
@gmoss: You can read [Python's source code to see what they do](https://docs.python.org/3/library/functions.html#pow) for [three arg `pow`](https://docs.python.org/3/library/functions.html#pow). It would beat your `modpow` function written in Python by doing more work at the C layer, but it's still pretty simple. Their representation for arbitrary precision isn't actually all that efficient (it wastes bits, and on earlier versions of Python, doesn't take advantage of 64 bit CPU math); your Python code isn't using any of `pow`s optimizations, so odds are your C code has some serious issues. — ShadowRanger, Jan 21 '16 at 01:55
@gmoss That at least explains why your code eats so much memory: Your division algorithm for example allocates about 1000 numbers each time it runs. If you never free any of those, it's clear that your memory is going to be full in a very short time. — fuz, Jan 21 '16 at 09:13
@AndrewHenle I'm using clang on ubuntu. And after some googling, now I know about compiler optimizations... Thanks for the hint, I'll play around with those. — gmoss, Jan 21 '16 at 17:18
@FUZxxl I guess I really do need to rework it! I now understand why so many C library functions require that you pass a pointer to the result instead of just returning the result. — gmoss, Jan 21 '16 at 17:22
@Olaf I acknowledge that this is badly written code because something like 4-5 months ago I was at `printf("hello world");`, and this code builds off of code that I started writing back then. Rewriting it hasn't been a priority because what I'm trying to do is to learn enough crypto and programming to get a "real" job after I finish my totally irrelevant math phd in June. I hadn't really thought about what made it so badly written until @FUZxxl pointed out the terrible memory management, so I guess I really do need to take a couple of days to redo this nonsense. — gmoss, Jan 21 '16 at 17:29
@squeamishossifrage Are you telling me that my processor has instructions to multiply two `uint32_t`s and come out with a `uint64_t`? Is any typecasting required? I.e. in C, can I do `uint64_t myfunc(uint32_t a, uint32_t b){ return a*b; }`, or do I need to typecast them to 64-bit types? What about division? — gmoss, Jan 21 '16 at 17:31
@gmoss Convert them to `uint64_t` first. Then the multiplication yields a 64 bit result. And if you consider rewriting your code: Do not prefix your function names with two underscores. Such names are reserved. — fuz, Jan 21 '16 at 18:04
@FUZxxl Great thanks. Is there a standard best-practice way of naming functions that are just internal library functions and not meant to be seen by the outside? Or just name them whatever and make them static? — gmoss, Jan 21 '16 at 18:13
@gmoss Use mapfiles if you make a shared library and call the functions something internal like `bii_name` for **b**ig**i**nt **i**nternal. — fuz, Jan 21 '16 at 18:23
See the C standard: http://port70.net/~nsz/c/c11/n1570.html#7.1.3 (you might want to read a bit more in the standard; although a bit formally and not didactically structured, it is a good reading for any C programmer). Note that your environment might reserve additional names. IIRC, POSIX reserves `_t` suffix, too. — too honest for this site, Jan 21 '16 at 18:40
@Olad Here is [what POSIX reserves](http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_02_02). — fuz, Jan 21 '16 at 19:03
@FUZxxl Thanks for being so helpful. I have been using `ar` to create library files. What are mapfiles? A google search did not turn anything up. — gmoss, Jan 22 '16 at 02:13
@gmoss Basically, for a *shared library,* a map file specifies which symbols are accessible from outside the library. You can use this to hide internal symbols, making them unavailable to the rest of the program. — fuz, Jan 22 '16 at 07:23
@gmoss Notably, map files don't work with static (`.a`) libraries due to how these work. — fuz, Jan 22 '16 at 07:51

Why is my C bigint modexp so slow, or, why is it so fast in python?

0 Answers0