Persistent Hashing of Strings in Python

Question

How would you convert an arbitrary string into a unique integer, which would be the same across Python sessions and platforms? For example hash('my string') wouldn't work because a different value is returned for each Python session and platform.

It would be helpful if you could clarify if you want a guarantee of uniqueness, or if you're satisfied with a high probability of uniqueness, as with a hash function. The fact that you're talking about hash() suggests the latter...? Do you need to be able to invert the mapping, or not? — , Jul 05 '21 at 14:13

score 48 · Answer 1 · edited May 09 '12 at 13:27

48

Use a hash algorithm such as MD5 or SHA1, then convert the hexdigest via int():

>>> import hashlib
>>> int(hashlib.md5('Hello, world!').hexdigest(), 16)
144653930895353261282233826065192032313L

edited May 09 '12 at 13:27

Community

1
1

answered Mar 24 '10 at 20:17

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

7

This is a good answer, but technically the integer produced is not *unique*. There are less MD5 hashes than available strings. However, chances of a collision are very low – Eli Bendersky Mar 24 '10 at 20:25
7

This is the case for any hash method. – MatthieuW Jul 01 '13 at 09:01
What does "very low" mean? Would it be unwise to use this algorithm in production when uniqueness is required? – kalu Jul 10 '14 at 22:27
If uniqueness is required then don't use a hash; use sequential numbering or a UUID instead. – Ignacio Vazquez-Abrams Jul 10 '14 at 23:35
4

Slight modification: If you want to constrain the size of the int: `int(hashlib.md5('Hello, world!').hexdigest()[:8], 16)` will be < 2**32, `int(hashlib.md5('Hello, world!').hexdigest()[:16], 16)` will be < 2**64. – Peter Sep 19 '14 at 23:22
in case you want an `int64` (like `hash()`) try: `struct.unpack('l', hashlib.md5('Hello'.encode('utf-8')).digest()[:8])[0]`: it uses all of the first 64 bits, including a sign. – Pierre D Oct 13 '20 at 22:25
3

If you get `TypeError: Unicode-objects must be encoded before hashing` you need to encode your string, for example: `int(hashlib.md5('Hello, world!'.encode('utf-8')).hexdigest(), 16)` – Dylan Hogg Aug 10 '22 at 07:06

Jason Sundram · Accepted Answer · 2010-03-24T20:53:28.783

8

If a hash function really won't work for you, you can turn the string into a number.

my_string = 'my string'
def string_to_int(s):
    ord3 = lambda x : '%.3d' % ord(x)
    return int(''.join(map(ord3, s)))

In[10]: string_to_int(my_string)
Out[11]: 109121032115116114105110103L

This is invertible, by mapping each triplet through chr.

def int_to_string(n)
    s = str(n)
    return ''.join([chr(int(s[i:i+3])) for i in range(0, len(s), 3)])

In[12]: int_to_string(109121032115116114105110103L)
Out[13]: 'my string'

edited Mar 24 '10 at 20:53

answered Mar 24 '10 at 20:41

Jason Sundram

12,225
19
71
86

5

This maps '\0' and '\0\0' to the same thing - you should prepend a '1'. Also this is a little inefficient, could use the hex representation so you'd have smaller numbers (this is then equivalent to using the string's binary representation and interpreting it as a number). – redtuna Mar 24 '10 at 20:58

score 3 · Answer 3 · answered Oct 23 '12 at 04:52

Here are my python27 implementation for algorithms listed here: http://www.cse.yorku.ca/~oz/hash.html. No idea if they are efficient or not.

from ctypes import c_ulong

def ulong(i): return c_ulong(i).value  # numpy would be better if available

def djb2(L):
  """
  h = 5381
  for c in L:
    h = ((h << 5) + h) + ord(c) # h * 33 + c
  return h
  """
  return reduce(lambda h,c: ord(c) + ((h << 5) + h), L, 5381)

def djb2_l(L):
  return reduce(lambda h,c: ulong(ord(c) + ((h << 5) + h)), L, 5381)

def sdbm(L):
  """
  h = 0
  for c in L:
    h = ord(c) + (h << 6) + (h << 16) - h
  return h
  """
  return reduce(lambda h,c: ord(c) + (h << 6) + (h << 16) - h, L, 0)

def sdbm_l(L):
  return reduce(lambda h,c: ulong(ord(c) + (h << 6) + (h << 16) - h), L, 0)

def loselose(L):
  """
  h = 0
  for c in L:
    h += ord(c);
    return h
  """
  return sum(ord(c) for c in L)

def loselose_l(L):
  return reduce(lambda h,c: ulong(ord(c) + h), L, 0)

redtuna · Answer 4 · 2010-03-24T20:40:29.283

First off, you probably don't really want the integers to be actually unique. If you do then your numbers might be unlimited in size. If that really is what you want then you could use a bignum library and interpret the bits of the string as the representation of a (potentially very large) integer. If your strings can include the \0 character then you should prepend a 1, so you can distinguish e.g. "\0\0" from "\0".

Now, if you prefer bounded-size numbers you'll be using some form of hashing. MD5 will work but it's overkill for the stated purpose. I recommend using sdbm instead, it works very well. In C it looks like this:

static unsigned long sdbm(unsigned char *str)
{
    unsigned long hash = 0;
    int c;

    while (c = *str++)
        hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}

The source, http://www.cse.yorku.ca/~oz/hash.html, also presents a few other hash functions.

You're quite correct. This would definitely be a problem if I was trying to convert entire documents to a number. However, for my application, I'll only be converting short strings, usually less than a few dozen characters. — Cerin, Mar 25 '10 at 13:30

score 0 · Answer 5 · answered Jan 23 '13 at 01:55

Here's another option, quite crude (probably has many collisions) and not very legible.

It worked for the purpose of generating an int (and later on, a random color) for different strings:

aString = "don't panic"
reduce( lambda x,y:x+y, map( lambda x:ord(x[0])*x[1],zip( aString, range( 1, len( aString ) ) ) ) )

Persistent Hashing of Strings in Python

5 Answers5

Linked