How to find duplicate files in large filesystem whilst avoiding MemoryError

Question

I am trying to avoid duplicates in my mp3 collection (quite large). I want to check for duplicates by checking file contents, instead of looking for same file name. I have written the code below to do this but it throws a MemoryError after about a minute. Any suggestions on how I can get this to work?

import os
import hashlib

walk = os.walk('H:\MUSIC NEXT GEN')

mySet = set()
dupe  = []

hasher = hashlib.md5()

for dirpath, subdirs, files in walk:
    for f in files:
        fileName =  os.path.join(dirpath, f)
        with open(fileName, 'rb') as mp3:
            buf = mp3.read()
            hasher.update(buf)
            hashKey = hasher.hexdigest()
            print hashKey
            if hashKey in mySet:
                dupe.append(fileName)
            else:
                mySet.add(hashKey)


print 'Dupes: ' + str(dupe)

use a database and store the filename in primary key field might be one way to do it — Joran Beasley, May 05 '15 at 22:11
One question is, does the MemoryError occur during the filesystem walk, or when you try to create a string from `dupe`? You might try `for name in dupe: print name` instead. Or even better, output dupes as you find them rather than storing them until the end. — chepner, May 05 '15 at 22:12
Hashing MP3 files will only find *exact* binary duplicates. The same song but a couple of seconds shorter or encoded with different parameters or volume will give a different hash value. — Roland Smith, May 05 '15 at 22:16
How many mp3 files are we talking about here? Like more than 4 billion? — Joshua Grigonis, May 05 '15 at 22:21
What's the largest file in the directory? Like larger than your system memory? — Joshua Grigonis, May 05 '15 at 22:21
I think you have one file that is too big to read in 1 chunk. — tdelaney, May 05 '15 at 22:22
You can see if this answer helps: http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python — Joshua Grigonis, May 05 '15 at 22:24
I just tested the original code with a 16GB file (same size as my system memory), and it doesn't give a memory error. Although, it may fail better on windows. — Joshua Grigonis, May 05 '15 at 22:28
Possible duplicate of http://stackoverflow.com/questions/4949162/max-limit-of-bytes-in-method-update-of-hashlib-python-module — Eric Renouf, May 05 '15 at 22:33

Stefan Pochmann · Answer 1 · 2015-05-12T14:32:20.150

You probably have a huge file that can't be read at once like you try with mp3.read(). Read smaller parts instead. Putting it into a nice little function also helps keeping your main program clean. Here's a function I've been using myself for a while now (just slightly polished it now) for a tool probably similar to yours:

import hashlib

def filehash(filename):
    with open(filename, mode='rb') as file:
        hasher = hashlib.md5()
        while True:
            buffer = file.read(1 << 20)
            if not buffer:
                return hasher.hexdigest()
            hasher.update(buffer)

Update: A readinto version:

buffer = bytearray(1 << 20)
def filehash(filename):
    with open(filename, mode='rb') as file:
        hasher = hashlib.md5()
        while True:
            n = file.readinto(buffer)
            if not n:
                return hasher.hexdigest()
            hasher.update(buffer if n == len(buffer) else buffer[:n])

With a 1GB file already cached in memory and ten attempts, this took on average 5.35 seconds. The read version took on average 6.07 seconds. In both versions, the Python process occupied about 10MB of RAM during the run.

I'll probably stick with the read version, as I prefer its simplicity and because in my real use cases, the data isn't already cached in RAM and I use sha256 (so the overall time goes up significantly and makes the little advantage of readinto even more irrelevant).

Note also that this fixes OP's other bug where a single hasher is updated instead of creating a new one each time. — tdelaney, May 05 '15 at 22:34

score 1 · Answer 2 · answered May 05 '15 at 22:12

1

hasher.update appends the content to the previous. You may want to create a new hasher for each file

answered May 05 '15 at 22:12

Eric Renouf

13,950
3
45
67

1

a great observation, but doesn't answer the question. – tdelaney May 05 '15 at 22:12
1

Why not? the hasher is accumulating the contents of all the files and concatinating them together, which is causing the out of memory error, if he resets the hasher each time that won't happen and the memory problem will be solved – Eric Renouf May 05 '15 at 22:13
@tdelaney -- it perfectly answers the question. The problem is, he is reading the entire contents of every file into the hasher. This is both causing the memory-overflow and stopping the process from working. – Michael Lorton May 05 '15 at 22:14
1

No, the hasher keeps a fixed size hash and discards hashed content. – tdelaney May 05 '15 at 22:15
@tdelaney I'm confused about the documentation then. It says > Update the hash object with the string arg. Repeated calls are equivalent to a single call with the concatenation of all the arguments: m.update(a); m.update(b) is equivalent to m.update(a+b). Can you point me to where I can find the size limit of the hasher? – Eric Renouf May 05 '15 at 22:20
OP, tdelany suggest that the hasher does not maintain the file contents as part of its state. Could you have one ultra-large file somewhere? If so, look at Stefan Pochmann's suggestion. – Michael Lorton May 05 '15 at 22:26
1

@EricRenouf - the job of a hasher is to condense a large blob of data into a small hash. It does this by shifting around small chunks of data, blending them into the existing hash, discarding the data and repeating with the next chunk. – tdelaney May 05 '15 at 22:27
1

It's like I give you one number every day for a week and then ask you for them sum. You don't need to remember all the numbers I gave you, only the running sum. – Stefan Pochmann May 05 '15 at 22:41
1

That is identical to `x = 0; x += a; x += b` versus `x = a + b`. No need to store `a` until `b` is available. Just think of a hash to be a checksum. But the problem with the accumulation across all files persists. – too honest for this site May 05 '15 at 22:58
Thanks for the very useful analogies and explanations tdelaney, Stefan-pochmann and olaf – Eric Renouf May 05 '15 at 23:10

How to find duplicate files in large filesystem whilst avoiding MemoryError

2 Answers2

Linked