How to use 64-bit unsigned integer math in Python, respecting C overflow?

Question

I'm trying to implement the djb2 hash in Python.

Here it is in C:

/* djb2 hash http://www.cse.yorku.ca/~oz/hash.html */

uint64_t djb2(size_t len, char const str[len]) {
    uint64_t hash = 5381;
    uint8_t c;
    for(size_t i = 0; i < len; i++) {
        c = str[i];
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
    }
    return hash;
}

And here's my attempt in Python:

from ctypes import c_uint64, c_byte, cast, POINTER

def djb2(string: str) -> c_uint64:
    hash = c_uint64(5381)
    raw_bytes = cast(string, POINTER(c_byte * len(string)))[0]
    for i in range(0, len(raw_bytes)):
        hash = c_uint64((((((hash.value << 5) & 0xffffffffffffffff) + hash.value) & 0xffffffffffffffff) + raw_bytes[i]) & 0xffffffffffffffff) # hash * 33 + c
    return hash

However, I'm getting different results between the two, which I suspect is because of different overflow behavior, or otherwise mathematical differences.

The reason for the masking in the python version was to attempt to force an overflow (based on this answer).

I would replace all `int` objects with the corresponding ctypes int object to start with. e.g. `5 ` and `0xffffffffffffffff` are Python `int` objects. — juanpa.arrivillaga, Mar 07 '19 at 21:40
Can you give an example string that you get the wrong answer for? — Blckknght, Mar 07 '19 at 21:46

score 3 · Accepted Answer · answered Mar 07 '19 at 22:31

You can implement the algorithm being run by the C code very easily in pure Python, without needing any ctypes stuff. Just do it all with regular Python integers, and take a modulus at the end (the high bits won't effect the lower ones for the operations you're doing):

def djb2(string: bytes) -> int:  # note, use a bytestring for this, not a Unicode string!
    h = 5381
    for c in string:    # iterating over the bytestring directly gives integer values
        h = h * 33 + c  # use the computation from the C comments, but consider ^ instead of +
    return h % 2**64    # note you may actually want % 2**32, as this hash is often 32-bit

As I commented in the code, since this is an operation defined on bytestrings, you should use a bytes instance as the argument. Note that there are a bunch of different implementations of this algorithm. Some use use ^ (bitwise xor) instead of + in the step where you update the hash value, and it's often defined to use an unsigned long which was usually 32-bits instead of the explicitly 64-bit integer the C version in your question uses.

Thank you! In this case, I'm in control of both the C and Python implementations, so 64-bit is what I want, but that is an excellent point for anyone else who comes across this issue. — winduptoy, Mar 08 '19 at 00:45
This implementation is very inefficient. It takes more than 2.5 minutes to calculate a hash of 1 MB bytestring on my laptop. — Andriy Makukha, Jul 31 '21 at 08:08

score 1 · Answer 2 · answered Jul 31 '21 at 08:06

When calculating DJB2 hash in Python, you have to avoid using long arithmetic. For this purpose, you have to do hash &= 0xFFFFFFFFFFFFFFFF after each iteration.

Here is a proper one-liner implementation of DJB2 in Python:

import functools, itertools
djb2 = lambda x: functools.reduce(lambda x,c: (x*33 + c) & ((1<<64)-1), itertools.chain([5381], x))

Notes:

because Python is a scripting language, doing the (x << 5) + x instead of x*33 is not more efficient
((1<<64)-1) is just a short for 0xFFFFFFFFFFFFFFFF

How to use 64-bit unsigned integer math in Python, respecting C overflow?

2 Answers2