Computing fast, non-cryptographically secure hashes of files in Python

Question

There is some good discussion on Stack Overflow of computing hashes of files, including files too large to fit in memory, in Python (1, 2, 3). They end up with solutions that look something like this (lightly edited from #3):

def md5_file(path, size):
    m = hashlib.md5()
    with open(path, 'rb') as f:
        b = f.read(size)
        while len(b) > 0:
            m.update(b)
            b = f.read(size)
    return m.digest()

If you don't need your hash function to be cryptographically secure (which I don't) then there's pyfasthash (seemingly aka pyhash), as discussed here. Unfortunately, the pyfasthash's hash classes lack the update method used above. I haven't had much luck figuring out what else to do; the Python-C code mixture is beyond me. I'm just reading the whole file into memory like this:

with open(path, 'rb') as afile:
    return hasher(afile.read())

The disadvantages of this approach are:

The file has to fit in memory
It's slower. According to #3 you want the amount of your file you load into memory at once to be small enough to avoid page swaps (about 64KiB on the poster's machine).

Is there any way I can calculate hashes of my files more quickly?

I don't know about the Python implementation, but cryptographically secure hashes such as SHA1 (yeah I know) or SHA256 on modern machines may be faster, as modern processors provide (big parts of) them in hardware. — Matteo Italia, Dec 26 '18 at 00:16
That being said, in general if you need to hash big files the speed of the hashing algorithm isn't going to be particularly important (as long as it's something that is reasonably fast) - IO is going to be slower anyway. — Matteo Italia, Dec 26 '18 at 00:18
Thanks, @MatteoItalia. The pyfasthash algorithms are faster than SHA1, although many of them do cluster in runtime as if they are IO-bound. If the page swaps are out of band with respect to the IO then this clever buffering may indeed not matter much for files that fit in memory. — kuzzooroo, Dec 26 '18 at 01:42
In the notes for pyhash they show examples of hashing by pieces. — President James K. Polk, Dec 27 '18 at 01:37
@JamesKPolk, are you referring to "`hasher('hello', ' ', 'world')` is a syntax sugar for `hasher('world', seed=hasher(' ', seed=hasher('hello')))`" or am I missing something else? I was scared off by the object construction and the warning that neither may equal `hasher('hello world')`, although the latter might be survivable if I make sure always to use the same chunk size. — kuzzooroo, Dec 27 '18 at 05:54
Yes, I'm referring to that. I admit it's pretty crappy and the warning notes do mean you need to do some testing. You don't seem to have many good choices. I would put in a feature request to have them provide an API that efficiently supports the hashlib API, but it's open source so you'd probably get faster results just forking the repo and doing it yourself. — President James K. Polk, Dec 27 '18 at 13:41

Computing fast, non-cryptographically secure hashes of files in Python

0 Answers0