fast, large-width, non-cryptographic string hashing in python

Question

I have a need for a high-performance string hashing function in python that produces integers with at least 34 bits of output (64 bits would make sense, but 32 is too few). There are several other questions like this one on Stack Overflow, but of those every accepted/upvoted answer I could find fell in to one of a few categories, which don't apply (for the given reason.)

Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.
Use hashlib. hashlib provides cryptographic hash routines, which are far slower than they need to be for non-cryptographic purposes. I find this self-evident, but if you require benchmarks and citations to convince you of this fact then I can provide that.
Use the string.__hash__() function as a prototype to write your own function. I suspect this will be the correct way to go, except that this particular function's efficiency lies in its use of the c_mul function, which wraps around 32 bits - again, too small for my use! Very frustrating, it's so close to perfect!

An ideal solution would have the following properties, in a relative, loose order of importance.

Have an output range extending at least 34 bits long, likely 64 bits, while preserving consistent avalanche properties over all bits. (Concatenating 32-bit hashes tends to violate the avalanche properties, at least with my dumb examples.)
Portable. Given the same input string on two different machines, I should get the same result both times. These values will be stored in a file for later re-use.
High-performance. The faster the better as this function will get called roughly 20 billion times during the execution of the program I'm running (it is the performance-critical code at the moment.) It doesn't need to be written in C, it really just needs to outperform md5 (somewhere in the realm of the built-in hash() for strings).
Accept a 'perturbation' (what's the better word to use here?) integer as input to modify the output. I put an example below (the list formatting rules wouldn't let me place it nearer.) I suppose this isn't 100% necessary since it can be simulated by perturbing the output of the function manually, but having it as input gives me a nice warm feeling.
Written entirely in Python. If it absolutely, positively needs to be written in C then I guess that can be done, but I'd take a 20% slower function written in python over the faster one in C, just due to project coordination headache of using two different languages. Yes, this is a cop-out, but this is a wish list here.

'Perturbed' hash example, where the hash value is changed drastically by a small integer value n

def perturb_hash(key,n):
    return hash((key,n))

Finally, if you're curious as to what the heck I'm doing that I need such a specific hash function, I'm doing a complete re-write of the pybloom module to enhance its performance considerably. I succeeded at that (it now runs about 4x faster and uses about 50% of the space) but I noticed that sometimes if the filter got large enough it was suddenly spiking in false-positive rates. I realized it was because the hash function wasn't addressing enough bits. 32 bits can only address 4 billion bits (mind you, the filter addresses bits and not bytes) and some of the filters I'm using for genomic data double that or more (hence 34 bit minimum.)

Thanks!

Is there anything wrong with `hash(s) * 2**32 + hash(s+s)`? If `hash` is "good enough" then that's "good enough", isn't it? Assuming that `hash(s+s)` bears no discernable relation to `hash(s)`, then you get your avalanche across all output bits. And if it's not fast enough due to the memory allocation, you can code in C to in effect apply the hash algorithm to `s+s`, but without actually performing the string concatenation. — Steve Jessop, Mar 23 '11 at 09:55
So in other words, hash(s)<<32 + hash(s+s). I'll give it a shot - thanks for the idea! — eblume, Mar 24 '11 at 02:15

samplebias · Accepted Answer · 2011-06-20T19:43:51.350

26

Take a look at the 128-bit variant of MurmurHash3. The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension. (Updated the author recommends using the 128-bit variant and throwing away the bits you don't need).

If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package, which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output.

Update I did a quick Python wrapper for the Murmur3 hash function. Github project is here and you can find it on Python Package Index as well; it just needs a C++ compiler to build; no Boost required.

Usage example and timing comparison:

import murmur3
import timeit

# without seed
print murmur3.murmur3_x86_64('samplebias')
# with seed value
print murmur3.murmur3_x86_64('samplebias', 123)

# timing comparison with str __hash__
t = timeit.Timer("murmur3.murmur3_x86_64('hello')", "import murmur3")
print 'murmur3:', t.timeit()

t = timeit.Timer("str.__hash__('hello')")
print 'str.__hash__:', t.timeit()

Output:

15662901497824584782
7997834649920664675
murmur3: 0.264422178268
str.__hash__: 0.219163894653

edited Jun 20 '11 at 19:43

answered Mar 23 '11 at 03:00

samplebias

37,113
6
107
103

You know I had seen this module but couldn't get it to compile due to missing the C++ python_boost library (or was it boost_python?). I'll take another look at it. I'll let you know how it goes - thanks! – eblume Mar 23 '11 at 05:00
1

Yep, it requires Boost Python. On Ubuntu this can be installed with `sudo apt-get install libboost-python-dev`. I built a [package in my PPA](https://launchpad.net/~spaceboy/+archive/extras/+sourcepub/1562800/+listing-archive-extra) as an example. – samplebias Mar 23 '11 at 06:10
Unfortunately Ubuntu's package management system is still back with python 2.6 so I had to install 2.7 on the side. I could be incredibly dense but it looks like Boost Python has a wickedly difficult manual install. Any tips? – eblume Mar 23 '11 at 07:28
Yep, I also did a performance test on the Boost-wrapper murmur2 and found it lacking, so I created my own wrapper around murmur3. Check the update above. This should get you going. :-) – samplebias Mar 23 '11 at 14:51
This is quite fantastic, thanks very much! I have a question though - platform.cpp has some mentions of processor affinity in it. The code that will be executing the hashing function is already highly parallelized - I hope that won't cause problems with this package? – eblume Mar 23 '11 at 21:52
The SetAffinity function doesn't get called in the murmur3 code, and platform.cpp is there just for completeness. – samplebias Mar 23 '11 at 23:17
I won't get to test this until tomorrow but I'm pretty sure it's the right way to go. Thanks again! – eblume Mar 24 '11 at 02:17
1

In my personal use case, in which I am always specifying a 'seed' value (aliased as the 'perturb' value I mentioned in my original question), murmur3 implemented in C++ outperforms python's hash((key,n)) by more than 10%. This is definitely the way to go. Thanks very much! – eblume Mar 25 '11 at 20:06
@samplebias It would be great if you can update this. I would like to try it, but it won't compile on my mac and I don't feel like diving much deeper. – ldmtwo Jun 17 '20 at 04:44

score 10 · Answer 2 · answered Nov 16 '19 at 12:04

BE CAREFUL WITH THE BUILT-IN HASH FUNCTION!

Since Python3, it's fed with a different seed every time the interpreter starts (I don't know more details), thus it generates different values every time -- but not with with native numeric types.

$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-1756730906053498061 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4556027264747844925 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4403217265550417031 322818021289917443

Correct. You would have to correctly set PYTHONHASHSEED in the env before the hash function is initialized. — ldmtwo, Jun 17 '20 at 04:12

score 4 · Answer 3 · answered Sep 04 '21 at 08:17

Have a look at xxHash, there's also the pip package.

xxHash is an Extremely fast Hash algorithm, running at RAM speed limits. It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical across all platforms (little / big endian).

I've been using xxHash for a long time (my typical use case is to hash strings -- not for security purposes) and I'm really satisfied of the performance.

this worked really well for my use case. – Peter Gaultney May 13 '23 at 00:30 — Peter Gaultney, May 13 '23 at 00:30

score 3 · Answer 4 · answered Aug 31 '12 at 18:25

Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.

That's not true. The built-in hash function will generate a 64-bit hash on a 64-bit system.

This is the python str hashing function from Objects/stringobject.c (Python version 2.7):

static long
string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;      /* Notice the 64-bit hash, at least on a 64-bit system */

    if (a->ob_shash != -1)
    return a->ob_shash;
    len = Py_SIZE(a);
    p = (unsigned char *) a->ob_sval;
    x = *p << 7;
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    if (x == -1)
        x = -2;
    a->ob_shash = x;
    return x;
}

Another problem with the built-in `hash()` function is hash randomization in Python 3.3. If the bloom filter needs to be able to be written to disk, then the built-in `hash()` function can't be used. — amcnabb, Feb 01 '13 at 22:03
Also, implementations of python on cloud platforms like Heroku and GAE will return different values for `hash()` on different instances making it useless for anything that must be shared between two or more "machines" (dynos in the case of heroku) — B Robster, Jan 08 '15 at 01:37

score 2 · Answer 5 · answered Mar 23 '11 at 03:05

2

"strings": I'm presuming you wish to hash Python 2.x str objects and/or Python3.x bytes and/or bytearray objects.

This may violate your first constraint, but: consider using something like

(zlib.adler32(strg, perturber) << N) ^ hash(strg)

to get a (32+N)-bit hash.

answered Mar 23 '11 at 03:05

John Machin

81,303
11
141
189

You are correct in your assumption that I'm hashing `str` objects - I'll look in to this snippet, thanks, but you're right, I personally doubt that there is consistent entropy to each output bit here. Thanks though! – eblume Mar 23 '11 at 04:58

score 0 · Answer 6 · answered Mar 23 '11 at 04:29

0

If you can use Python 3.2, the hash result on 64-bit Windows is now a 64-bit value.

answered Mar 23 '11 at 04:29

casevh

11,093
1
24
35

I've been using Python 2.7, but if the hash width in the 3.x engine is definitely, consistently wider then that might be enough to get me to switch. Thanks! – eblume Mar 23 '11 at 04:56
@eblume: The 64-bit hash on 64-bit Windows is an enhancement in 3.2. 64-bit Linux platforms have always had a 64-bit hash value. 32-bit versions of Python (both Linux and Windows) only have a 32-bit hash value. – casevh Mar 23 '11 at 13:45

fast, large-width, non-cryptographic string hashing in python

6 Answers6

Linked

Related