57

I want to compute an md5 hash not of a string, but of an entire data structure. I understand the mechanics of a way to do this (dispatch on the type of the value, canonicalize dictionary key order and other randomness, recurse into sub-values, etc). But it seems like the kind of operation that would be generally useful, so I'm surprised I need to roll this myself.

Is there some simpler way in Python to achieve this?

UPDATE: pickle has been suggested, and it's a good idea, but pickling doesn't canonicalize dictionary key order:

>>> import cPickle as pickle
>>> import hashlib, random 
>>> for i in range(10):
...  k = [i*i for i in range(1000)]
...  random.shuffle(k)
...  d = dict.fromkeys(k, 1)
...  p = pickle.dumps(d)
...  print hashlib.md5(p).hexdigest()
...
51b5855799f6d574c722ef9e50c2622b
43d6b52b885f4ecb4b4be7ecdcfbb04e
e7be0e6d923fe1b30c6fbd5dcd3c20b9
aebb2298be19908e523e86a3f3712207
7db3fe10dcdb70652f845b02b6557061
43945441efe82483ba65fda471d79254
8e4196468769333d170b6bb179b4aee0
951446fa44dba9a1a26e7df9083dcadf
06b09465917d3881707a4909f67451ae
386e3f08a3c1156edd1bd0f3862df481
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662

8 Answers8

103

json.dumps() can sort dictionaries by key. So you don't need other dependencies:

import hashlib
import json

data = ['only', 'lists', [1,2,3], 'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(json.dumps(data, sort_keys=True).encode('utf-8')).hexdigest()

print(data_md5)

Prints:

87e83d90fc0d03f2c05631e2cd68ea02
webwurst
  • 4,830
  • 3
  • 23
  • 32
  • 1
    Excellent, and hashlib and json are always present with python – Stéphane Feb 06 '13 at 14:30
  • 4
    Nice solution, but bear in mind that some data types cannot be converted to JSON without **extra work** - `datetime` being the most common. `data = ['1234', 234, datetime.datetime(2013,1,1)]` `hashlib.md5(json.dumps(a, sort_keys=True)).hexdigest()` results in `TypeError: datetime.datetime(2013, 1, 1, 0, 0) is not JSON serializable` – Boris Chervenkov Oct 19 '13 at 13:54
  • 2
    @Boris: It's fairly easy to get the `json` module to serialize many more data types (including instances of most user-defined classes as well as `datetime.datetime` instances) as shown in [my answer](http://stackoverflow.com/a/18561055/355230) to the question [_Making object JSON serializable with regular encoder_](http://stackoverflow.com/questions/18478287/making-object-json-serializable-with-regular-encoder). – martineau Nov 10 '15 at 00:44
  • 20
    In python3 there will be a `TypeError: Unicode-objects must be encoded before hashing` So use this `data_md5 = hashlib.md5(json.dumps(data, sort_keys=True).encode('utf-8')).hexdigest()` – Dineshs91 Mar 08 '17 at 06:51
  • Very clever solution for a deterministic hashing function in pure Python – Chris Conlan Jul 12 '20 at 20:30
  • Is the dict order issue still a thing? Python 3.7 and later [preserves insertion order of a dictionary](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict). – Jeyekomon Apr 20 '21 at 13:37
33

bencode sorts dictionaries so:

import hashlib
import bencode
data = ['only', 'lists', [1,2,3], 
'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(bencode.bencode(data)).hexdigest()
print data_md5

prints:

af1b88ca9fd8a3e828b40ed1b9a2cb20
Blairg23
  • 11,334
  • 6
  • 72
  • 72
Dan D.
  • 73,243
  • 15
  • 104
  • 123
  • Yes, bencode seems to do exactly the thing I imagined, but with the extra feature of being reversible. – Ned Batchelder Mar 24 '11 at 12:57
  • 16
    It should be noted that [`bencode`](https://pypi.python.org/pypi/bencode/1.0) isn't a standard Python 2 or 3 module. – martineau Nov 10 '15 at 00:31
  • Its also doesn't encode as many data structures as pickle :-( – user48956 Dec 09 '16 at 01:35
  • I don't think this should be the accepted solution anymore. bencode depends on a module BTL that doesn't seem to be readily available anymore (cf. https://github.com/bittorrent/bencode/issues/3). – David Kaufman Dec 20 '20 at 22:45
  • 1
    @DavidKaufman The issue the reporter finds there is that how relative imports work changed. The module bencode consists of a directory with a `__init__.py` file that imports from the `BTL.py` file in the directory. The BTL module is right there in the bencode package. It's just that now imports like that have to use the dot prefixed form. But there are older versions of the bencode module that are a single self-contained file. – Dan D. Dec 21 '20 at 05:37
  • @DavidKaufman If you think that, tell the asker to reconsider. I only presented an option. And it might have not been even a good one. – Dan D. Dec 21 '20 at 05:41
8

I ended up writing it myself as I thought I would have to:

class Hasher(object):
    """Hashes Python data into md5."""
    def __init__(self):
        self.md5 = md5()

    def update(self, v):
        """Add `v` to the hash, recursively if needed."""
        self.md5.update(str(type(v)))
        if isinstance(v, basestring):
            self.md5.update(v)
        elif isinstance(v, (int, long, float)):
            self.update(str(v))
        elif isinstance(v, (tuple, list)):
            for e in v:
                self.update(e)
        elif isinstance(v, dict):
            keys = v.keys()
            for k in sorted(keys):
                self.update(k)
                self.update(v[k])
        else:
            for k in dir(v):
                if k.startswith('__'):
                    continue
                a = getattr(v, k)
                if inspect.isroutine(a):
                    continue
                self.update(k)
                self.update(a)

    def digest(self):
        """Retrieve the digest of the hash."""
        return self.md5.digest()
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • How would you handle the ``set`` type? Out of my head I'd say, the same way as you handle tuples and lists, with the difference being that it should first be ``sorted()``. Would you agree? – exhuma Jul 06 '12 at 13:30
  • Yes, sorted(v) would be the way to go. – Ned Batchelder Jul 06 '12 at 16:27
  • Beware, calls to `update()` essentially concatenate. So this will hash `{'test': 1}` to the same thing as `['test', 1]` and `test1`. Which may or may not be the desired behaviour. https://docs.python.org/2/library/md5.html – Ross Hemsley Jan 12 '16 at 17:53
  • @RossHemsley You overlooked `self.md5.update(str(type(v)))` – Ned Batchelder Jan 12 '16 at 19:05
  • True, that helps. It might be worth noting that there are edge cases in doing it this way. – Ross Hemsley Jan 13 '16 at 15:29
6

You could use the builtin pprint that will cover some more cases than the proposed json.dumps() solution. For example datetime-objects will be handled correctly.

Your example rewritten to use pprint instead of json:

>>> import hashlib, random, pprint
>>> for i in range(10):
...     k = [i*i for i in range(1000)]
...     random.shuffle(k)
...     d = dict.fromkeys(k, 1)
...     print hashlib.md5(pprint.pformat(d)).hexdigest()
... 
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
b4e5de6e1c4f3c6540e962fd5b1891db
UlfR
  • 4,175
  • 29
  • 45
3

UPDATE: this won't work for dictionaries due to key order randomness. Sorry, I've not thought of it.

import hashlib
import cPickle as pickle
data = ['anything', 'you', 'want']
data_pickle = pickle.dumps(data)
data_md5 = hashlib.md5(data_pickle).hexdigest()

This should work for any python data structure, and for objects as well.

rvs
  • 1,251
  • 13
  • 22
2

While it does require a dependency on joblib, I've found that joblib.hashing.hash(object) works very well and is designed for use with joblib's disk caching mechanism. Empirically it seems to be producing consistent results from run to run, even on data that pickle mixes up on different runs.

Alternatively, you might be interested in artemis-ml's compute_fixed_hash function, which theoretically hashes objects in a way that is consistent across runs. However, I've not tested it myself.

Sorry for posting millions of years after the original question

Connor Brinton
  • 361
  • 3
  • 9
0

ROCKY way: Put all your struct items in one parent entity (if not already), recurse and sort/canonicalize/etc them, then calculate the md5 of its repr.

Hossein
  • 4,097
  • 2
  • 24
  • 46
0

Calculating checksum upon a JSON serialization is a good idea as it's easy to implement and easy to extend for some Python data structures that are natively not JSON serializable.

This is my revised version of @webwurst's answer, which generates the JSON string in chunks that are immediately consumed to calculate the final checksum, to prevent excessive memory consumption for a large object:

import hashlib
import json

def checksum(obj, method='md5'):
    m = hashlib.new(method)
    encoder = json.JSONEncoder(
        # don't escape Unicode chars to save bytes
        ensure_ascii=False,

        # skip checking to save an in-memory mapping
        check_circular=False,

        # if mappings with different key order are to be treated as
        # identical
        sort_keys=True,
                         
        # reduce default spaces to be more compact
        separators=(',', ':'),
    )
    for chunk in encoder.iterencode(obj):
        m.update(chunk.encode('UTF-8'))

    # use .digest() to save a few bytes if the checksum is used only 
    # for internal comparison and not to be output
    return m.hexdigest()

data = [
    'only',
    'lists', [1,2,3],
    'dictionaries', {'a':0,'b':1},
    'numbers', 47,
    'strings', '哈囉世界'
]
chk = checksum(data)
print(chk)
Danny Lin
  • 2,050
  • 1
  • 20
  • 34