3

The documentation for the python library Murmur is a bit sparse.

I have been trying to adapt the code from this answer:

import hashlib
from functools import partial

def md5sum(filename):
    with open(filename, mode='rb') as f:
        d = hashlib.md5()
        for buf in iter(partial(f.read, 128), b''):
            d.update(buf)
    return d.hexdigest()

print(md5sum('utils.py'))

From what I read in the answer, the md5 can't operate on the whole file at once so it needs this looping. Not sure exactly what would happen on the line d.update(buf) however.

The public methods in hashlib.md5() are:

 'block_size',
 'copy',
 'digest',
 'digest_size',
 'hexdigest',
 'name',
 'update'

whereas mmh3 has

'hash',
'hash64',
'hash_bytes'

No update or hexdigest methods..

Does anyone know how to achieve a similar result?

The motivation is testing for uniqueness as fast as possible, the results here suggests murmur is a good candidate.

Update -

Following the comment from @Bakuriu I had a look at mmh3 which seems to be better documented.

The public methods inside it are:

import mmh3
print([x for x in dir(mmh3) if x[0]!='_'])
>>> ['hash', 'hash128', 'hash64', 'hash_bytes', 'hash_from_buffer']

..so no "update" method. I had a look at the source code for mmh3.hash_from_buffer but it does not look like it contains a loop and it is also not in Python, can't really follow it. Here is a link to the line

So for now will just use CRC-32 which is supposed to be almost as good for the purpose, and it is well documented how to do it. If anyone posts a solution will test it out.

cardamom
  • 6,873
  • 11
  • 48
  • 102
  • 1
    *"Not sure exactly what would happen on the line `d.update(buf)`"* - The loop reads 128-byte chunks of the file and updates the hash with them. Hashes can be calculated incrementally - keep updating with new bytes until you're done. This works well with streaming data (think web requests) or when you don't want to load an entire file into RAM before hashing it. 128 bytes are very small chunks though, this not optimal for performance. Chunk sizes of 8 KB or more are likely to produce better results. – Tomalak Oct 08 '18 at 16:19
  • 2
    The problem is that you are using a random module from 2009, which probably does not follow the "standard API" and may not have feature parity with it. I'd use something like [`murmurhash`](https://pypi.org/project/murmurhash/) or [`mmh3`](https://pypi.org/project/mmh3/) instead. – Bakuriu Oct 08 '18 at 16:31
  • Thanks for the explanation, I am not sure though that every one of these has an "update" method, and if it is not in the library, you can't really do anything. @Bakuriu , I had to uninstall that random module from 2009 as it has the same name as mmh3 that you linked! From what I can tell, that one is better documented than the other one. – cardamom Oct 09 '18 at 09:17

1 Answers1

1

To hash a file using murmur, one has to load it completely into memory and hash it in one go.

import mmh3

with open('main.py') as file:
    data = file.read()

hash = mmh3.hash_bytes(data, 0xBEFFE)
print(hash.hex())

If your file is too large to fit into memory, you could use incremental/progressive hashing: add your data in multiple chunks and hash them on the fly (like your example above).

Is there a Python library for progressive hashing with murmur?
I tried to find one, but it seems there is none.

Is progressive hashing even possible with murmur?
There is a working implementation in C:

pscheid
  • 450
  • 4
  • 10
  • 2
    If you think this question has an answer somewhere else in this site - [flag it as duplicate](https://stackoverflow.com/help/privileges/flag-posts) instead of posting a link to an answer as an answer... – Tomerikoo Jan 06 '21 at 12:14
  • 1
    If the links are dead on the answer you link to, you can always propose an edit to update them. That would be helpful to future readers who might want the links. – EJoshuaS - Stand with Ukraine Jan 06 '21 at 14:54
  • Thank you all very much for your feedback. This question is not a duplicate of the other question, but they are clearly related. I edited my answer to focus more on this question. I tried to edit the links in the linked answer, but was not able to. It seems in the meantime a moderator edited the other question. The answer I linked to is now removed and a comment shows the working links again. – pscheid Jan 06 '21 at 15:15
  • reading the file in binary mode with `open('main.py', 'rb')` works too. im using this in [murmurhash-cli-python](https://github.com/milahu/murmurhash-cli-python) – milahu Oct 31 '21 at 16:58