3

I have a problem where I generate randomly a dictionary, with a possibly high number of possibilities (say, I have 25'000 possibly different dics). I want to generate an identifier, an ID, for every one of these possibilities. What I want is:

  • If two dictionaries have exactly the same values for each key, then the ID is the same
  • If two dictionaries have a different ID, then they must have at least one difference in their content.
  • The ID stays the same everytime I run the program ( id(x) does not work )
  • Bonus: the ID stays the same for different versions of Python (2.6, 2.7, 3.4, 3.6)

My current idea is to use hash functions (although I understand little about it) and do something like this (suppose a dictionary of int/float numbers):

import hashlib
def getID(mydic):
    ID = 0
    for x in mydic.keys():
        # Hash the content
        ID = ID + int(hashlib.sha256(str(mydic[x]).encode('utf-8')).hexdigest(), 16)
        # Hash the key
        ID = ID + int(hashlib.sha256(x.encode('utf-8')).hexdigest(), 16)
    return (ID % 10**10)

To my understanding, this should work in most cases, but depending on the actual content of the dictionary and the keys, it's not impossible that two different dics yield the same ID. For example, if I do not hash the keys and two different entries can be "1.0", then I can have a problem.

Do you have anything to suggest, which hopefully does not rely on luck?

Edit: I add a bigger code on what I'm trying to do: it's basically a random parameter optimisation. Code on pastebin

martineau
  • 119,623
  • 25
  • 170
  • 301
Milleuros
  • 87
  • 2
  • 9
  • What exactly are you trying to achieve? – tyteen4a03 May 20 '17 at 20:02
  • 1
    Any hash that maps a large input set to a smaller one will have collisions. So there is always a bit of "luck" involved. What you could do is: compare the dicts by your id. If the ids are different the dicts are different. If the ids are the same compare the dicts by value. – Wombatz May 20 '17 at 20:04
  • @tyteen4a03 I have an algorithm that I want to test on a many sets of parameters. I pick a set of parameters randomly, and then run my algorithm - but I want to be able to save that set of parameters to a file without it being overwritten by another set, so that I can always know which parameters led to which results. Do you want me to post a larger code, e.g. on pastebin? – Milleuros May 20 '17 at 20:08
  • If you're not limited to numbers then I would just do `sha1(repr(sorted(my_dict.items())))` (inspired by [this comment](https://stackoverflow.com/questions/5884066/hashing-a-python-dictionary)). Otherwise see @Wombatz 's comment. – tyteen4a03 May 20 '17 at 20:14
  • Question, why would the order of a dictionary matter? – Milleuros May 20 '17 at 20:16
  • because dictionaries are (for all practical purposes) unordered. Dictionaries containing the same keys and values could display differently depending on the history of key addition/deletion. – MSeifert May 20 '17 at 20:16
  • @MSeifert I still do not get it: since the hash I proposed above loops over the keys and values and hashes the values and key strings themselves, it should be ok right? – Milleuros May 20 '17 at 20:21
  • no because `{'a': 1, 'b': 2}` would generate a different hash compared to `{'b': 2, 'a': 1}` (same dictionary, only different order). – MSeifert May 20 '17 at 20:22
  • But regarding the question: Do all your dictionaries contain the same keys? – MSeifert May 20 '17 at 20:24
  • @MSeifert Yes, they do contain the same keys. For the dictionary order, I just gave a try to my function and the two dics you proposed, it yields me the same number. This is because the function in my OP is a sum (commutative) of all hash keys and all hash values. – Milleuros May 20 '17 at 20:30
  • Don't let that fool you, the order isn't garantueed across python versions (with string not even across interpreter restarts) and that was a requirement. Just because it works most of the times means it works always! – MSeifert May 20 '17 at 20:33
  • @MSeifert I still do not understand - hashes are deterministic, so `hash('a') + hash('b')` should be equal to `hash('b') + hash('a')` right? Anyways, what if I adapt the above function by sorting the keys? [Pastebin link](https://pastebin.com/RJeXSCGF) – Milleuros May 20 '17 at 20:42
  • Hashing strings is not-deterministic because they randomized string-hashes (they are only constant per interpreter session, the next session could have different hashes). With sorting you at least have deterministic behaviour. – MSeifert May 20 '17 at 20:44
  • @MSeifert Sorry if this feels like I'm arguing, I'm trying to understand. I wrote a small script that prints the hash of a sample string (some random keyboard mashing), then I ran this script 6 times in Python 2.7 and 25 times in Python 3.4, everytime I got the same value. (Everytime running a different instance of Python). I wouldn't understand why the hash of a string is randomised between sessions since this, to my very limited understanding, breaks the point of a hash function (only learned 5 days ago what a hash is) – Milleuros May 20 '17 at 20:52
  • I get two different values for the `hash('abc')`: `2448848014665344704` and `-2203656332252580611`, both the same python 3.5.3 version, just different sessions. – MSeifert May 20 '17 at 20:56
  • @MSeifert I was using `hashlib.sha256()` , maybe that's why it worked? – Milleuros May 20 '17 at 20:57
  • you're mixing the issues: The `hash` is used to detemine the "order" of the dictionary and because these hashes are randomized the output may differ. And then `hashlib.sha256("{'b': 1, 'a': 1}".encode('utf-8')) == hashlib.sha256("{'a': 1, 'b': 1}".encode('utf-8'))` gives `False` – MSeifert May 20 '17 at 21:00
  • @MSeifert because those are different strings and it's simply turning the dictionary into a string. But the sum of `sha256(x)` for x in `mydic.keys() + mydic.values` should work: I am turning the actual content of the dictionary into a hash, not the string representation of the dictionary. Again, sincere apologies for arguing. – Milleuros May 20 '17 at 21:05

2 Answers2

0

To create an ID, you need to create a non-mutable object. Since keys are unordered, you may need to sort them.

For instance:

mydict = {'a': 1, 'c': 9, 'b': 3}

values = tuple(sorted(mydict.items()))
# -> (('a', 1), ('b', 3), ('c', 9))

Then, you can use your own hash algorithm, for instance with sha256:

import hashlib

def hash_item(m, k, v):
    m.update(k.encode('utf-8'))
    m.update(str(k).encode('utf-8'))

m = hashlib.sha256()
for k, v in values:
    hash_item(m, k, v)
print(m.digest())
# -> b'\xa5\xb42\xee\x03\x07\xbe\x7f\xa2:\xa0\x04a\xf5N\xee4\xba\x9dE%\x1bU\x04V}7\xa8\xda3\x9d\xff'
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103
0

rely on luck; every one else does for good reason. Unless your ID is longer than the longest dictionary you can encode, or you choose not to be able to encode some dictionaries, then there will be multiple dictionaries that have the same ID. It's a simple matter of counting. Say you name one dictionary 1, and another two, and so on. Either you eventually run out of numbers, or your ID gets longer. CGenerally we use IDs or hashes when we want some small quantity that will stand in for an object. If you'd be willing to have the name of a dictionary be as large as the dictionary itself, then you're looking for a canonical representation, not an ID or hash.

The advantage of something like sha256 is that we think it is very hard to find two inputs that have the same hash. Even though it is theoretically certain that there are multiple inputs that give the same sha256, we believe that no one has yet found two inputs that give the same sha256. So, you are almost certainly safe enough ignoring the possibility that you'll run across a hash collision.

Sam Hartman
  • 6,210
  • 3
  • 23
  • 40
  • Thank you, I'll probably accept your answer and rely on luck. Quick question, if I were to use "canonical representation" (I did not know that word), what should I start looking for? – Milleuros May 20 '17 at 21:02
  • So, that would be a different question, but take a look at [canonicaljson](https://pypi.python.org/pypi/canonicaljson/) as an example of something that is likely to come quite close – Sam Hartman May 20 '17 at 21:04