2

Python has a builtin hash function for getting a hash of objects. The digest seems to be only 64 bits, so it's very prone to collision. Is there a way to use a more secure hash function on objects?

python has a builtin hashlib library, but it doesn't work on objects the way the hash() function does. Is there a way to encode a class the way the builtin hash function does?

import hashlib
class hashme:
    a=33
    b=22
hasher=hashlib.sha256()
# declare a class
myclass=hashme()
# the normal python hash function works
print(hex(hash(myclass)))
# other hash functions don't work. This will raise an error.
print(hasher.update(myclass))
philn
  • 654
  • 6
  • 17
  • Are you trying to obfuscate code? If not, what is the purpose here? – roganjosh Jan 15 '19 at 22:22
  • You're mixing two different kinds of "hash". The `hash()` builtin is for use with a hashtable-based data structure (e.g. a `dict` or `set`) -- it should never be used for cryptographic purposes. It exists simply to allow a data structure to quickly check whether objects might be equal; it's only a performance optimization. Cryptographic hash functions like SHA-256 are for cases where you need a strong guarantee that a message has not been altered; that's not something you should ever need to do for objects that only exist in memory. – Daniel Pryden Jan 15 '19 at 22:30
  • You might find the answers to this question on Security.SE useful or relevant: [What is the benefit of having a cryptographically secure hash algorithm in hashmaps?](https://security.stackexchange.com/questions/195134/what-is-the-benefit-of-having-a-cryptographically-secure-hash-algorithm-in-hashm) – Daniel Pryden Jan 15 '19 at 22:35
  • I want a strong guarantee that the object hashes don't collide (and in this instance it's going to be written to disk). I'm not mixing up the hash() checksum with other hashes. Otherwise I wouldn't be asking the question! – philn Jan 15 '19 at 22:38
  • @philn: In that case, you don't want to be hashing the *objects*, you want to be hashing their *serialized representations* that you will be writing to disk. Basically: convert the object to `bytes` first, and then the `hashlib` functions will do what you want. – Daniel Pryden Jan 15 '19 at 22:39
  • @Daniel: That's what I want to do (and what I meant by "encode"). If you can provide a way to do that it'd answer my question. I'd like it to be as general as the hash() function. – philn Jan 15 '19 at 22:44
  • @philn: There is, unfortunately, no *general* way to do this for all objects. (Even the `hash()` function, while ostensibly defined for all objects, won't produce any useful result unless the object has a meaningful `__hash__` implementation.) If you want to do something general-purpose, you could try using `pickle` to serialize the object and hashing the pickled representation. – Daniel Pryden Jan 15 '19 at 22:52
  • 1
    Taking a step back: It sounds like you're trying to build some kind of transparent caching layer, and in my experience it's hard to make that kind of thing fully transparent. You'll probably need each object to define a key of some sort, which would be the set of fields that would uniquely identify it. There's no general-purpose algorithm for this, because it depends on what makes two objects "the same" in a given context. I think the most maintainable solution will be to add a method to each class you care about that returns the things that need to be hashed to uniquely identify it. – Daniel Pryden Jan 15 '19 at 22:55
  • 1
    Are you trying to create a function that will generate a unique, fixed-length value for any object passed to it? If so, you should know that it's impossible in the general case. – Jim Mischel Jan 16 '19 at 03:25
  • 1
    The built in hash() is unsuitable for use as persistent cache key. The built in hash isn't guaranteed to stay consistent across processes (ref. PYTHONHASHSEED), it uses object identity, which is usually memory location (so all instances of objects compares unequal by default, and reconstituted objects likely will also have unequal hash to the original). You don't really want to "encode a class the way the builtin hash function does?", unless you're creating an in-process, transient cache, in which case, which is essentially just a regular dict. – Lie Ryan Jan 16 '19 at 23:19
  • Related: https://stackoverflow.com/questions/64344515/python-consistent-hash-replacement – Albert Sep 19 '22 at 07:41

0 Answers0