Short Python alphanumeric hash with minimal collisions

Question

I'd like to set non-integer primary keys for a table using some kind of hash function. md5() seems to be kind of long (32-characters).

What are some alternative hash functions that perhaps use every letter in the alphabet as well as integers that are perhaps shorter in string length and have low collision rates?

Thanks!

score 42 · Answer 1 · edited Nov 06 '18 at 17:25

42

Why don't you just truncate SHA1 or MD5? You'll have more collisions then if you didn't truncate, but it's still better than designing your own. Note that you can base64-encode the truncated hash, rather than using hexadecimal. E.g.

import base64
import hashlib
hasher = hashlib.sha1("The quick brown fox")
base64.urlsafe_b64encode(hasher.digest()[:10])

You can truncate as little (including not at all) or as much as you want, as long as you understand the tradeoffs.

EDIT: Since you mentioned URL-safe, you can use urlsafe_b64encode and urlsafe_b64decode, which uses - and _ rather than + and /.

edited Nov 06 '18 at 17:25

vallentin

23,478
6
59
81

answered Mar 24 '10 at 19:16

Matthew Flaschen

278,309
50
514
539

Thanks. Is there any low collision alphanumeric hash function, less than say 16 characters, that does not involve truncating? Thank you. – ensnare Mar 24 '10 at 19:28
3

Why don't you want to truncate? – Matthew Flaschen Mar 24 '10 at 20:13
3

You may also want to remove all `=` characters added to the end. They don't substantially reduce collision rate, but they add two characters. So maybe something like: `base64.urlsafe_b64encode(hasher.digest()[0:10]).replace('=', '')` – speedplane Aug 16 '17 at 15:28
4

@speedplane Minor thing but `.rstrip('=')` makes the intent a little clearer – bcattle Apr 12 '18 at 21:07
1

I don't think you need to use `b64encode` the `.hexdigest()` method of the hash object returns a string containing only hexadecimal digits. – monkut Apr 24 '20 at 04:47
Use `.encode('utf-8')` if you get complains about encoding the initial string – Abramodj May 14 '20 at 17:28
@monkut the reason to b64encode would be to increase the amount of data "per character", i.e. reduce the number of potential collisions for the same length hash. – Ben Page Nov 30 '22 at 19:01
Maybe I'm missing something, but b64encode is only an encoding of the resulting digest. Just because you _expand_ the truncated result of the sha1 digest by b64 encoding the result doesn't increase the entropy. From what I can see it's reduced. You should be better off truncating the digest result to 16, instead of using a digest of 10 then encoding to 16. – monkut Dec 01 '22 at 01:01

score 38 · Answer 2 · edited Jan 22 '19 at 19:34

The smallest builtin hash I am aware of is md5

>>> import hashlib, base64
>>> d=hashlib.md5(b"hello worlds").digest(); d=base64.b64encode(d); 
>>> print(d)

b'S27ylES0wiLdFAGdUpFgCQ=='

Low collision and short are somewhat mutually exclusive due to the birthday paradox

To make it urlsafe you need to use the function from the base64 module

>>> import base64
>>> base64.urlsafe_b64encode(hashlib.md5("hello world").digest())
'XrY7u-Ae7tCTyyK7j1rNww=='

However there should be no problem storing the 16 byte md5 digest in the database in binary form.

>>> md5bytes=hashlib.md5("hello world").digest()
>>> len(md5bytes)
16
>>> urllib.quote_plus(md5bytes)
'%5E%B6%3B%BB%E0%1E%EE%D0%93%CB%22%BB%8FZ%CD%C3'

Python 2

>>> base64.urlsafe_b64encode(md5bytes)
'XrY7u-Ae7tCTyyK7j1rNww=='

Python 3

>>> base64.urlsafe_b64encode(md5bytes).decode('ascii')
'XrY7u-Ae7tCTyyK7j1rNww=='

You can choose either the quote_plus or the urlsafe_b64encode for your url, then decode with the corresponding function unquote_plus or urlsafe_b64decode before you look them up in the database.

Use `hashlib.hexdigest()` to get a url safe string without conversion — Veilkrand, Apr 26 '19 at 06:26

score 2 · Answer 3 · answered Mar 24 '10 at 20:14

2

Below is a solution that uses alphanumeric characters plus a few punctuation characters. It returns very short strings (around 8 characters).

import binascii, struct

def myhash(s):
    return binascii.b2a_base64(struct.pack('i', hash(s)))

answered Mar 24 '10 at 20:14

Daniel Stutzbach

74,198
17
88
77

1

`hash(s)` gives a different result for 32/64 bit platforms – John La Rooy Mar 24 '10 at 20:44
2

@gnibbler The question doesn't list consistency across platforms as a requirement. – Daniel Stutzbach Mar 26 '10 at 00:53
5

hash() also gives different results each time you run a python instance – wordsforthewise Aug 15 '18 at 05:53

Paulo Scardine · Answer 4 · 2010-09-10T05:42:50.973

0

You can use something like base 32 notation. It is more compact than decimal notation, case insensitive and collision-free. Just encode a plain old sequence number to generate a short hash-like code.

If the key is not for human consumption, you can use base 64 notation, which is case sensitive but a little more compact.

See http://code.google.com/p/py-cupom/ for an example.

edited Sep 10 '10 at 05:42

answered Sep 10 '10 at 05:35

Paulo Scardine

73,447
11
124
153

In python you can use `base64.b32encode` for that. – Managarm Aug 08 '22 at 12:58

score 0 · Answer 5 · answered Oct 26 '22 at 09:51

0

I used xor of md5 bytes to get shorter hash

>>> d = hashlib.md5(b"hello worlds").digest()  # 16 bytes

# xor of bytes to get 3 hash bytes
>>> h = bytes([
     d[0] ^ d[1] ^ d[2] ^ d[3] ^ d[14] ^ d[15], 
     d[4] ^ d[5] ^ d[6] ^ d[7] ^ d[13], 
     d[8] ^ d[9] ^ d[10] ^ d[11] ^ d[12]],
     )  

>>> base64.urlsafe_b64encode(h)
b'8xC5'

# 4 digit str
>>> base64.urlsafe_b64encode(h).decode('utf-8')  
'8xC5'

answered Oct 26 '22 at 09:51

Ryabchenko Alexander

10,057
7
56
88

much better just to truncate. This will increase the number of collisions a lot because of xor is order independant. i.e. `X ^ Y ^ Z == Y ^ Z ^ X` – Ben Page Nov 30 '22 at 19:04

Short Python alphanumeric hash with minimal collisions

5 Answers5

Linked