3

IIUC python hash of functions (e.g. for use as keys in dict) is not stable across runs.

Can something like dill or other libraries be used to get a hash of a function which is stable across runs and different computers? (id is of course not stable).

Uri
  • 25,622
  • 10
  • 45
  • 72
  • No, Python does not use `id` to has functions. `id` provides a unique identifier for an object that is stable for its lifetime, but it is not necessarily related to the hash used for a `dict`. – chepner Sep 19 '19 at 01:04
  • Updated. (how does python hash functions then?) – Uri Sep 19 '19 at 01:36
  • These seems interesting: https://stackoverflow.com/a/38518893/378594 https://stackoverflow.com/a/54015815/378594 – Uri Sep 19 '19 at 01:40
  • Related: https://stackoverflow.com/questions/64344515/python-consistent-hash-replacement – Albert Sep 19 '22 at 07:39

3 Answers3

2

I'm the dill author. I've written a package called klepto which is a hierarchical caching/database abstraction useful for local memory hashing and object sharing across parallel/distributed resources. It includes several options for building ids of functions.

See klepto.keymaps and klepto.crypto for hashing choices -- some work across parallel/distributed resources, some don't. One of the choices is serialization with dill or otherwise.

klepto is similar to joblib, but designed specifically to have object permanence and sharing beyond a single python session. There may be something similar to klepto in dask.

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • Hey, thanks for the answer. Could you elaborate if this allow for more than just the signature hashing, i.e. do different static functions with the same signature get different hash, and if so how does it work? Is it based on static analysis? Practically I'm looking for something that gets a `Callable` and returns `Text`, where `Callable` can be a `functools.partial`, a `toolz.curry`'d function or just a static function. – Uri Sep 19 '19 at 01:30
  • The intent is that you use an archive. The interface for a `klepto.archive` is a dictionary with some extensions. You can generate the key however you like, and store the function as the value. The chosen `keymap` translates the user-selected key to a key that is stored in the archive. So, if you chose `klepto.keymaps.picklemap`, you could in theory pass the callable as the key and value... and it would store the serialized object as the key, and (potentially) the serialized object as the value. How value is saved is related to the type of archive you pick (e.g. HDF, picklefile, SQL, etc). – Mike McKerns Sep 19 '19 at 04:26
0

As you mentioned, id will almost never be the same across different processes and though surely across different machines. As per docs:

id(object): Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.

This means that id should be different because the objects created by every instance of your script reside in different places in the memory and are not the same object. id defines the identity, it's not a checksum of a block of code.

The only thing that will be consistent over different instances of your script being executed is the name of the function.

One other approach that you could use to have a deterministic way to identify a block of code inside your script would be to calculate a checksum of the actual text. But controlling the contents of your methods should rather be handled by a versioning system like git. It is likely that if you need to calculate a hash sum of your code or a piece of it, you are doing something suboptimally.

Artur
  • 973
  • 1
  • 14
  • 29
0

I stubled about "hash() is not stable across runs" today. I am now using

def stable_hash(a_string):
    sha256 = hashlib.sha256()
    sha256.update(bytes(a_string, "UTF-8"))
    digest = sha256.digest()
    h = 0
    #
    for index in range(0, len(digest) >> 3):
        index8 = index << 3
        bytes8 = digest[index8 : index8 + 8]
        i = unpack('q', bytes8)[0]
        h = xor(h, i)
    #
    return h

It's for string arguments. To use it e.g. for a dict you would pass str(tuple(sorted(a_dict.items()))) or something like that as argument. The "sorted" is important in this case to get a "canonical" representation.

amotzek
  • 41
  • 1
  • 6
  • the question is about hashing a function, not strings. – Uri Aug 13 '22 at 23:30
  • you can get the string of the defintion of a function f with inspect.getsource(f) – amotzek Aug 15 '22 at 17:46
  • Yeah but that's not very safe, because it depends on the file and the names in context. Maybe if you add the filename and line it is better, but then the content is redundant. – Uri Aug 20 '22 at 12:27