0

So, I saw Hashing a dictionary?, and I was trying to figure out a way to handle python native objects better and produce stable results.

After looking at all the answers + comments this is what I came to and everything seems to work properly, but am I maybe missing something that would make my hashing inconsistent (besides hash algorithm collisions)?

md5(repr(nested_dict).encode()).hexdigest()

tl;dr: it creates a string with the repr and then hashes the string.

Generated my testing nested dict with this:

for i in range(100):
    for j in range(100):
        if not nested_dict.get(i,None):
            nested_dict[i] = {}
        nested_dict[i][j] = ''

I'd imagine the repr should be able to support any python object, since most have to have the __repr__ support in general, but I'm still pretty new to python programming. One thing that I've heard of when using from reprlib import repr instead of the stdlib one that it'll truncate large sequences. So, that's one potential downfall, but it seems like the native list and set types don't do that.

other notes:

elrey
  • 13
  • 2
  • 3
    I didn't understand what you are trying to do. The title doesn't help me, despite a question mark it isn't really a question. – mkrieger1 Apr 21 '22 at 15:07
  • This approach is not correct – juanpa.arrivillaga Apr 21 '22 at 15:18
  • "I was trying to figure out a way to handle python native objects better and produce stable results." What exactly are you trying to accomplish? – juanpa.arrivillaga Apr 21 '22 at 15:20
  • at the end of the day I'm trying to implement __hash__ on a python object and one of the attributes that needs to be hashed is a dictionary that has IPv4Address objects as keys. I was tacking onto that other question, because I figured I could just actually hash the dict and then pass that to hash() for my __hash__ xor of values. – elrey Apr 21 '22 at 15:46
  • sorry __hash__ should be `__hash__` – elrey Apr 21 '22 at 15:59

1 Answers1

2

Python dicts are insert ordered. The repr respects that. Your hexdigest of {"A":1,"B":2} will differ from {"B":2,"A":1} whereas == - wise those dicts are the same.

Yours won't work out:

from  hashlib import md5

def yourHash(d):
    return md5(repr(d).encode()).hexdigest()

a = {"A":1,"B":2}
b = {"B":2,"A":1}

print(repr(a)) 
print(repr(b))
print (a==b)

print(yourHash(a) == yourHash(b))

gives

{'A': 1, 'B': 2}   # repr a
{'B': 2, 'A': 1}   # repr b
True               # a == b

False              # your hashes equall'ed

I really do not see the "sense" in hashing dicts at all ... and those ones here are not even "nested".


You could try JSON to sort keys down to the last nested one and using the json.dumps() of the whole structure to be hashed - but still - don't see the sense and it will give you plenty computational overhead:

import json
a = {"A":1,"B":2, "C":{2:1,3:2}}
b = {"B":2,"A":1, "C":{3:2,2:1}}

for di in (a,b):
    print(json.dumps(di,sort_keys=True))

gives

{"A": 1, "B": 2, "C": {"2": 1, "3": 2}}  # thouroughly sorted
{"A": 1, "B": 2, "C": {"2": 1, "3": 2}}  # recursively...

which is exactly what this answer in Hashing a dictionary? proposes ... why stray from it?

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • The biggest reason is because in my "production" use case I need to have IPv4 Addresses as the keys for the dictionaries. json.dumps doesn't support native IPv4Address objects, and errors out with that. Sorry if maybe I didn't clarify that. So, would sorting before I pass to repr be a suitable fix for that? – elrey Apr 21 '22 at 15:44
  • 1
    @elrey but IP4 adresses are easily converted to/from strings .... and they can get into the dict as key. Still, JSONifying them should work out, when retrieving them from JSON again you need some custom code to make IP4 addresses again then. – Patrick Artner Apr 21 '22 at 15:46
  • I had done that before for something else, but I figured it would be relatively computationally complex (since I'll over time potentially be converting 10s of thousands of IPs), compared to just keeping them in their native object form. So, I was just trying to figure out if there was a better way while in their native object form. – elrey Apr 21 '22 at 15:49
  • it may be surprising, but you can often take an IP address as an `int` (it is, after all, simply a 32-bit or 128-bit number!) .. however, this will confuse readers if this file is meant to be read https://docs.python.org/3/library/ipaddress.html#conversion-to-strings-and-integers – ti7 Apr 21 '22 at 15:52
  • I read about how you could possibly extend the __format__ builtin on a python object, and create a custom format code (i.e. `j` for json). This IMO would be the simplest way to get to the desired state while not forcing the object to lose it internal representation or re-define the whole object again. Then could pass a custom function to `sorted()` which would leverage that custom format option. – elrey Apr 29 '22 at 02:30