I have a dictionary master
which contains around 50000 to 100000 unique lists which can be simple lists or also lists of lists. Every list is assigned to a specific ID (which is the key of the dictionary):
master = {12: [1, 2, 4], 21: [[1, 2, 3], [5, 6, 7, 9]], ...} # len(master) is several ten thousands
Now I have a few hundreds of dictionarys which again contain around 10000 lists (same as above: can be nested). Example of one of those dicts:
a = {'key1': [6, 9, 3, 1], 'key2': [[1, 2, 3], [5, 6, 7, 9]], 'key3': [7], ...}
I want to cross-reference this data for every single dictionary in reference to my master
, i.e. instead of saving every list within a
, I want to only store the ID of the master
in case the list is present in the master
.
=> a = {'key1': [6, 9, 3, 1], 'key2': 21, 'key3': [7], ...}
I can do that by looping over all values in a
and all values of master
and try to match the lists (by sorting them), but that'll take ages.
Now I'm wondering how would you solve this?
I thought of "hashing" every list in master
to a unique string and store it as a key of a new master_inverse
reference dict, e.g.:
master_inverse = {hash([1,2,4]): 12, hash([[1, 2, 3], [5, 6, 7, 9]]): 21}
Then it would be very simple to look it up later on:
for k, v in a.items():
h = hash(v)
if h in master_inverse:
a[k] = master_inverse[h]
Do you have a better idea? How could such a hash look like? Is there a built-in-method already which is fast and unique?
EDIT: Dunno why I didn't come up instantly with this approach: What do you think of using a m5-hash of either the pickle or the repr() any single list?
Something like this:
import hashlib
def myHash(str):
return hashlib.md5(repr(str)).hexdigest()
master_inverse = {myHash(v): k for k, v in master.items()}
for k, v in a.items():
h = myHash(v)
if h in master_inverse:
a[k] = master_inverse[h]
EDIT2:
I benched it: To check one of the hundred dicts (in my example a
, a
contains for my benchmark around 20k values) against my master_inverse
is very fast, didn't expect that: 0.08sec. So I guess I can live with that well enough.