Alternative to python hash function for arbitrary objects

Question

In python2.7, I'm successfully using hash() to place objects into buckets stored persistently on disk. A mockup code looks like this:

class PersistentDict(object):
  def __setitem__(self, key, value):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    self._store_to_bucket(bucket_index, key, value)

  def __getitem__(self, key):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    return self._fetch_from_bucket(bucket_index)[key]

In python3, hash() uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:

Must be stable across interpreter invocations
May require parameters supplied at execution time, e.g. setting a salt in the call
Must support arbitrary objects (anything supported by dict/set)

I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash() works with almost everything.

[1] Using hash() to identify buckets is either:

Not reliable across interpreter invocations, if salts are random
Prevents applications from using the random salting feature, if salts are fixed
Unusable if two PersistentDicts were created with different salts

The random hash only applies to `str`, `bytes` and `datetime` objects. You'd only need an alternative for those types. — Martijn Pieters, Jun 24 '16 at 09:17
@MartijnPieters Thanks! I guess str and bytes can be covered by zlib/hashlib. I'll see whether I can find something fast for datetime. — MisterMiyagi, Jun 24 '16 at 09:26
`str(datetime)` should suffice; that gets you the ISO8601 format with microseconds and optionally the timezone if attached, and then you have a `str`... — Martijn Pieters, Jun 24 '16 at 09:46
I'd still test thoroughly and look for alternatives if timezones are involved however; I suspect that timezone subtleties could lead to equal ISO8601 representations for otherwise different timezone objects, in ways that could matter. — Martijn Pieters, Jun 24 '16 at 09:47
@MartijnPieters It seems that salting also applies to ``None``. I have yet to find a suitable solution for hashable containers, without restricting it to ``repr`` or the like. — MisterMiyagi, Dec 07 '17 at 09:50
For `None` the `__hash__` implementation is inherited from `type`, which produces the hash of the memory address of the object, since this varies from interpreter to interpreter process, the value can appear random. This means that setting `PYTHONHASHSEED` **will not affect the hash value of `None`**, like it would for types to which the seed does apply. — Martijn Pieters, Dec 07 '17 at 10:28
So I guess I have to modify my first comment to include type objects (custom classes and built-in) and `None`. — Martijn Pieters, Dec 07 '17 at 10:31
Related: https://stackoverflow.com/questions/64344515/python-consistent-hash-replacement — Albert, Sep 19 '22 at 07:43

MisterMiyagi · Answer 1 · 2017-12-07T11:00:45.917

I've had success using a combination of hash and zlib.adler32. The most straightforward implementation is this:

def hashkey(obj, salt=0):
  """
  Create a key suitable for use in hashmaps

  :param obj: object for which to create a key
  :type: str, bytes, :py:class:`datetime.datetime`, object
  :param salt: an optional salt to add to the key value
  :type salt: int
  :return: numeric key to `obj`
  :rtype: int
  """
  if obj is None:
    return 0
  if isinstance(obj, str):
    return zlib.adler32(obj.encode(), salt) & 0xffffffff
  elif isinstance(obj, bytes):
    return zlib.adler32(obj, salt) & 0xffffffff
  elif isinstance(obj, datetime_type):
    return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff

With Python 3.4.3, this is a lot slower than calling plain hash, which takes roughly 0.07 usec. For a regular object, hashkey takes ~1.0 usec instead. 0.8 usec for bytes and 0.7 for str.

Overhead is roughly as follows:

0.1 usec for the function call (hash(obj) vs def pyhash(obj): return hash(obj))
0.2 usec to 0.5 usec for selecting the hash function via isinstance
0.75 usec for zlib.adler32 or zlib.crc32 vs hash: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec)
0.15 usec for obj.encode() of str objects ("foobar")
1.5 usec for str(obj).encode() of datetime.datetime objects

The most optimization comes from ordering of the if statements. If one mostly expects plain objects, the following is the fastest I could come up with:

def hashkey_c(obj, salt=0):
  if obj.__class__ in hashkey_c.types:
    if obj is None:
      return 0
    if obj.__class__ is str:
      return zlib.adler32(obj.encode(), salt) & 0xffffffff
    elif obj.__class__ is bytes:
      return zlib.adler32(obj, salt) & 0xffffffff
    elif obj.__class__ is datetime_type:
      return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}

Total time: ~0.7 usec for str and bytes, abysmal for datetime, 0.35 usec for objects, ints, etc. Using a dict to map type to hash comparable, if one uses an explicit check on the dict keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types but obj.__class__ in hashkey.explicit_dict_types).

Some additional notes:

hash is not stable across interpreter starts for any object using the default __hash__ implementation, including None
It does not work properly for immutable containers (which define __hash__) containing a salted type, e.g. (1, 2, 'three')

As of python3.6 or so this is not working. It is not stable across different interpreter executions. E.g. try hashkey(None). — Neal Young, Dec 07 '17 at 06:24
@NealYoung Thanks for checking. It seems ``hash(None)`` is salted at least since Python3.4 - I've added this case now. I also realised that hashable containers, such as Tuple, would require special casing as well. — MisterMiyagi, Dec 07 '17 at 09:48

score 1 · Answer 2 · answered Jul 19 '22 at 13:02

A good alternative is xxhash. It provides fast integer hashes (as Python's hash), but they are consistent across multiple runs and machines:

import xxhash
print(xxhash.xxh32_intdigest('hash me please'))
print(xxhash.xxh64_intdigest('hash me please'))
print(xxhash.xxh128_intdigest('hash me please'))

obj = [0, 1, 2]
print(xxhash.xxh64_intdigest(str(obj)))
print(xxhash.xxh64_intdigest(str(obj), seed=0))
print(xxhash.xxh64_intdigest(str(obj), seed=42))

'''
Prints:
  465606393
  6686454294630346756
  110192986912562192471431245034848549222
  11514819435353980464
  11514819435353980464
  11772420285327955252
'''

Since it hashes strings, for arbitrary objects you may hash str(your_object) or repr(your_object). You can also set an optional seed parameter that defaults to zero, as shown in the example.

Yet another alternative is pyhash. It works fine on Linux, but after some bad experiences installing it on Windows machines, I recommend xxhash which works well on all platforms and is even faster than pyhash for some small examples that I tested.

score 0 · Answer 3 · answered Sep 19 '22 at 09:32

We use this code for our workflow manager Sisyphus, where we use hashes to store the outputs of jobs on disks, based on input parameters, thus we hash the input parameters.

Specifically:

def short_hash(obj,
               length=12,
               chars='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'):
    """
    :param object obj:
    :param int length:
    :param str|T chars:
    :rtype: str|T
    """
    h = hashlib.sha256(sis_hash_helper(obj)).digest()
    h = int.from_bytes(h, byteorder='big', signed=False)
    ls = []
    for i in range(length):
        ls.append(chars[int(h % len(chars))])
        h = h // len(chars)
    return ''.join(ls)


def get_object_state(obj):
    """
    Export current object status
    Comment: Maybe obj.__reduce__() is a better idea? is it stable for hashing?
    """

    if hasattr(obj, '__getnewargs_ex__'):
        args = obj.__getnewargs_ex__()
    elif hasattr(obj, '__getnewargs__'):
        args = obj.__getnewargs__()
    else:
        args = None

    if hasattr(obj, '__sis_state__'):
        state = obj.__sis_state__()
    elif hasattr(obj, '__getstate__'):
        state = obj.__getstate__()
    elif hasattr(obj, '__dict__'):
        state = obj.__dict__
    elif hasattr(obj, '__slots__'):
        state = {k: getattr(obj, k) for k in obj.__slots__ if hasattr(obj, k)}
    else:
        assert args is not None, "Failed to get object state of: %s" % repr(obj)
        state = None

    if args is None:
        return state
    else:
        return args, state


def sis_hash_helper(obj):
    """
    Takes most object and tries to convert the current state into bytes.
    :param object obj:
    :rtype: bytes
    """

    # Store type to ensure it's unique
    byte_list = [type(obj).__qualname__.encode()]

    # Using type and not isinstance to avoid derived types
    if isinstance(obj, bytes):
        byte_list.append(obj)
    elif obj is None:
        pass
    elif type(obj) in (int, float, bool, str, complex):
        byte_list.append(repr(obj).encode())
    elif type(obj) in (list, tuple):
        byte_list += map(sis_hash_helper, obj)
    elif type(obj) in (set, frozenset):
        byte_list += sorted(map(sis_hash_helper, obj))
    elif isinstance(obj, dict):
        # sort items to ensure they are always in the same order
        byte_list += sorted(map(sis_hash_helper, obj.items()))
    elif isfunction(obj):
        # Handle functions
        # Not a nice way to check if the given function is a lambda function, but the best I found
        # assert not isinstance(lambda m: m, LambdaType) is true for all functions
        assert obj.__name__ != '<lambda>', "Hashing of lambda functions is not supported"
        byte_list.append(sis_hash_helper((obj.__module__, obj.__qualname__)))
    elif isclass(obj):
        byte_list.append(sis_hash_helper((obj.__module__, obj.__qualname__)))
    elif hasattr(obj, '_sis_hash'):
        # sis job or path object
        return obj._sis_hash()
    else:
        byte_list.append(sis_hash_helper(get_object_state(obj)))

    byte_str = b'(' + b', '.join(byte_list) + b')'
    if len(byte_str) > 4096:
        # hash long outputs to avoid arbitrary long return values. 4096 is just
        # picked because it looked good and not optimized,
        # it's most likely not that important.
        return hashlib.sha256(byte_str).digest()
    else:
        return byte_str

Alternative to python hash function for arbitrary objects

3 Answers3

Linked