Hash files in Python larger than available RAM?

Question

I am currently doing a project where I turn my Pi (Model 4 2GB) into a sort of NAS archive. I decided to learn a bit of Python along the way and wrote a small console app to "manage" my data base. One function I added was that it hashes the files in the database so it knows when files are corrupted. To achieve this I hash a file like this:

with open(file, "rb") as f:
    rbytes = f.read()
    readable_hash = sha256(rbytes).hexdigest()

Now when I run this on smaller files it works just fine but on large files like videos it spits out a MemoryError - I presume this is because it doesn't have enough RAM to hold the file?

I've seen that you can break the read up into chunks but does this also work for hashing? If so, how?

Also I'm not a programmer. I want to learn in the process, so the simpler the solution the better - I want to actually understand the code I use. :) Doesn't need to be a super fast algorithm that squeezes out every millisecond either, as long as it gets the job done.

Thanks for any help in advance!

TL;DR -- yes, the hash library routines have a method called `.update` that can be used to feed stuff to them bit by bit. — Tim Roberts, May 03 '22 at 18:46

score 0 · Answer 1 · answered May 03 '22 at 18:50

One Solution is adding a part of the File with another already hashed file, the hash at the end will still consist of the File there a just a few extra steps.

import hashlib 
def hash_file(filename, bytes):
    hashed = "" #make a string
    with open(filename, 'rb') as f: #read from file
        while True: #read the defined number of bytes until the loop is closed/broke
            chunk = f.read(bytes) #read bytes
            if chunk: #as long as "chunk" is not None/Empty
                hashed = str(hashlib.md5(str(chunk).encode() + hashed.encode()).digest()) #Hash the old Hash and append the newly hashed chunk of text

            else:
                break #stop the Loop
    return hashed 
print(hash_file('file.txt', 1000))

By Hashing the Contents over and over again we always create a string that originates from the old string/hash, this way the string is always new and smaller (because MD5 hashes always have the same size) than the Whole File while being basically the old file.

PS: the bytes variable can be anything but more bytes = more Memory while less bytes = longer compute time, try what fits your needs. 1000–9000 Seems to be a good spot.

Hash files in Python larger than available RAM?

1 Answers1