The documentation for the python library Murmur is a bit sparse.
I have been trying to adapt the code from this answer:
import hashlib
from functools import partial
def md5sum(filename):
with open(filename, mode='rb') as f:
d = hashlib.md5()
for buf in iter(partial(f.read, 128), b''):
d.update(buf)
return d.hexdigest()
print(md5sum('utils.py'))
From what I read in the answer, the md5 can't operate on the whole file at once so it needs this looping. Not sure exactly what would happen on the line d.update(buf)
however.
The public methods in hashlib.md5()
are:
'block_size',
'copy',
'digest',
'digest_size',
'hexdigest',
'name',
'update'
whereas mmh3
has
'hash',
'hash64',
'hash_bytes'
No update
or hexdigest
methods..
Does anyone know how to achieve a similar result?
The motivation is testing for uniqueness as fast as possible, the results here suggests murmur is a good candidate.
Update -
Following the comment from @Bakuriu I had a look at mmh3 which seems to be better documented.
The public methods inside it are:
import mmh3
print([x for x in dir(mmh3) if x[0]!='_'])
>>> ['hash', 'hash128', 'hash64', 'hash_bytes', 'hash_from_buffer']
..so no "update" method. I had a look at the source code for mmh3.hash_from_buffer
but it does not look like it contains a loop and it is also not in Python, can't really follow it. Here is a link to the line
So for now will just use CRC-32 which is supposed to be almost as good for the purpose, and it is well documented how to do it. If anyone posts a solution will test it out.