Translate a hashing algorithm from C to Python

Question

My client is a Python programmer and I have created a C++ backend for him which includes license generation and checking. For additional safety, the Python front-end will also perform a validity check of the license.

The license generation and checking algorithm however is based on hashing methods which rely on the fact that an integer is of a fixed byte size and bit-shifting a value will not extend the integers byte count.

This is a simplified example code:

unsigned int HashString(const char* str) {
    unsigned int hash = 3151;
    while (*str != 0) {
        hash = (hash << 3) + (*str << 2) * 3;
        str++;
    }
    return hash;
}

How can this be translated to Python? The direct translation obviously yields a different result:

def hash_string(str):
    hash = 3151
    for c in str:
        hash = (hash << 3) + (ord(c) << 2) * 3
    return hash

For instance:

hash_string("foo bar spam")  #  228667414299004
HashString("foo bar spam")   // 3355459964

Edit: The same would also be necessary for PHP since the online shop should be able to generate valid licenses, too.

why not just make a .dll(or .so) of the c code where he could call the method from python using cdll calls — Joran Beasley, Sep 18 '13 at 21:17
@JoranBeasley: Asking the C++ code to validate the C++ code seems like it might defeat the purpose. (I'm not sure what exactly he's trying to protect against here, so it might not, either…) — abarnert, Sep 18 '13 at 21:18
The DLL could be replaced which would make hacking the application a breeze. — Niklas R, Sep 18 '13 at 21:25
One caution: `char` is ambiguous: signed or unsigned? [See here](http://stackoverflow.com/questions/2054939/char-is-signed-or-unsigned-by-default). Should change the C code to be explicit; i.e., `const unsigned char* str`. — Tim Peters, Sep 18 '13 at 21:26
@NiklasR: This is pretty trivial protection. Anyone who can hack your binary DLL and drop in his new version can presumably also hack your Python source code. (This might still be useful in some cases, as long as you know how little you're getting out of it.) — abarnert, Sep 18 '13 at 21:27
@TimPeters: Good point. Or just cast `*str` where you access it, instead of propagating things as possible-signed up until you add to `(hash << 3)`. — abarnert, Sep 18 '13 at 21:28
The hash code also looks "made up" - probably not good. For example, unless the string is empty, the last two bits of the hash are always 0. — Tim Peters, Sep 18 '13 at 21:30
so wait you are worried about someone dropping in a replacement dll (knowing what methods to implement and arguements and return types) but you have no qualms about exposing your hash function inside of python ? — Joran Beasley, Sep 18 '13 at 21:32
@TimPeters: Another good point. The standard trick for copy protection is to use something like HMAC-SHA1 with the key/salt obfuscated within the code. (Really, unless you can hide the key so well that it's harder to find than the algorithm is to crack, it doesn't have to be a good algorithm… but doing the standard trick makes it easier to prove that anyone who cracked it must have violated the DMCA or your contract terms.) — abarnert, Sep 18 '13 at 21:35

score 4 · Answer 1 · answered Sep 18 '13 at 21:15

Mask the hash value with &:

def hash_string(str, _width=2**32-1):
    hash = 3151
    for c in str:
        hash = ((hash << 3) + (ord(c) << 2) * 3)
    return hash & _width

This manually cuts the hash back to size. You only need to limit the result once; it's not as if those higher bits make a difference for the final result.

Demo:

>>> hash_string("foo bar spam")
3355459964

abarnert · Accepted Answer · 2013-09-18T21:31:20.870

The issue here is that C's unsigned int automatically rolls over when it goes past UINT_MAX, while Python's int just keeps getting bigger.

The easiest fix is just to correct at the end:

return hash % (1 << 32)

For very large strings, it maybe a little faster to mask after each operation, to avoid ending up with humongous int values that are slow to work with. But for smaller strings, that will probably be slower, because the cost of calling % 12 times instead of 1 will easily outweigh the cost of dealing with a 48-bit int.

PHP may have the same problem, or a different one.

PHP's default integer type is a C long. On a 64-bit Unix platform, this is bigger than an unsigned int, so you will have to use the same trick as on Python (either % or &, whichever makes more sense to you.)

But on a 32-bit Unix platform, or on Windows, this is the same size as unsigned int but signed, which means you need a different trick. You can't actually represent, say, 4294967293 directly (try it, and you'll get -3 instead). You can use a GMP or BCMath integer instead of the default type (in which case it's basically the same as in Python), or you can just write custom code for printing, comparing, etc. that will treat that -3 as if it were 4294967293.

Note that I'm just assuming that int is 32 bits, and long is either 32 or 64, because that happens to be true on every popular platform today. But the C standard only requires that int be at least 16 bits long, and long be at least 32 bits and no shorter than int. If you need to deal with very old platforms where int might be 16 bits (or 18!), or future platforms where it might be 64 or more, you have to adjust your code appropriately.

Translate a hashing algorithm from C to Python

2 Answers2