1

The standard hash('hello') function may generate different hashes on different machines, different versions of Python, and even different runs of the same program on the same Python version/machine.

What are decent pure Python (or built in) alternatives that have low chances of collision? The use case I'm after is checking uniqueness of a string in a database (note: it doesn't have to be 100% foolproof, just extremely unlikely to collide with another string).

Blixt
  • 49,547
  • 13
  • 120
  • 153

2 Answers2

1

Murmurhash is a good bet for non-cyrptographic uses (as opposed to hashlib, which uses slow crypto-hashes), for a variety of reasons:

  • very widely used

  • portable across not only Python versions and machines, but across different languages

Here is its Python bindings. Here is a question on the algorithm itself.


If, for some technical reason, a pure-Python roll-your-own implementation might be needed, there are a few feasible options:

  • the source code for murmurhash is in the first link above, but, in pure Python mode, should probably be replaced by something simpler

  • immediate candidates are the Knuth and Jenkins.

Community
  • 1
  • 1
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Thanks. Is there a package for it that will run in an environment that doesn't allow C modules? – Blixt May 20 '15 at 14:22
  • I'm not sure. What is an "enviornment that doesn't allow C modules": a technical or organizational environment? It depends... – Ami Tavory May 20 '15 at 14:24
  • App Engine runs (I believe) a modified version of pypy with a lot of restrictions due to their sandbox. Basically, I can't run anything compiled, it has to be written entirely in Python. – Blixt May 20 '15 at 14:28
  • See update. Hope it helps. – Ami Tavory May 20 '15 at 14:34
  • Thanks. Do you know how they compare to FNV-1a in terms of random distribution and collision rates? – Blixt May 20 '15 at 14:41
1

Try the python hashlib. It has implementations of MD5, SHA1, SHA224, SHA256, SHA384, and SHA512 which should give you a good hash with low chance of collision.

Example Input:

import hashlib
hashed_string = hashlib.sha512("hello").hexdigest()
print hashed_string

Output:

9b71d224bd62f3785d96d46ad3ea3d73319bfbc2890caadae2dff72519673ca72323c3d99ba5c11d7c7acc6e14b8c5da0c4663475c2e5c3adef46f73bcdec043
Igl3
  • 4,900
  • 5
  • 35
  • 69
  • 1
    Note that these are cryptographic & slow hashes. The OP's question is to the default ``hash`` which is in a different domain. – Ami Tavory May 20 '15 at 14:14
  • I'm not going to be doing thousands of these per second so I think `hashlib` will do, but yes I would have preferred to find a pure Python solution that was not cryptographic. I'll also look into Murmurhash. – Blixt May 20 '15 at 14:21