1

I'm experimenting with different ways to identify duplicate files, based on file content, by looping through the top level directory where folders A-Z exist. Within folders A-Z there is one additional layer of folders named after the current date. Finally, within the dated folders there are between several thousand to several million (<3 million) files in various formats.

Using the script below I was able to process roughly 800,000 files in about 4 hours. However, running it over a larger data set of roughly 13,000,000 files total it consistently breaks on letter "I" that contains roughly 1.5 million files.

Given the size of data I'm dealing with I'm considering outputting the content directly to a text file and then importing it into MySQL or something similar for further processing. Please let me know if I'm going down the right track or if you feel a modified version of the script below should be able to handle 13+ million files.

Question - How can I modify the script below to handle 13+ million files?

Error traceback:

Traceback (most recent call last):
  File "C:/Users/"user"/PycharmProjects/untitled/dups.py", line 28, in <module>
    for subdir, dirs, files in os.walk(path):
  File "C:\Python34\lib\os.py", line 379, in walk
    yield from walk(new_path, topdown, onerror, followlinks)
  File "C:\Python34\lib\os.py", line 372, in walk
    nondirs.append(name)
MemoryError

my code:

import hashlib
import os
import datetime
from collections import defaultdict


def hash(filepath):
    hash = hashlib.md5()
    blockSize = 65536
    with open(filepath, 'rb') as fpath:
        block = fpath.read(blockSize)
        while len(block) > 0:
            hash.update(block)
            block = fpath.read(blockSize)
    return hash.hexdigest()


directory = "\\\\path\\to\\files\\"
directories = [name for name in os.listdir(directory) if os.path.isdir(os.path.join(directory, name))]
outFile = open("\\path\\output.txt", "w", encoding='utf8')

for folder in directories:
    sizeList = defaultdict(list)
    path = directory + folder
    print("Start time: " + str(datetime.datetime.now()))
    print("Working on folder: " + folder)
    # Walk through one level of directories
    for subdir, dirs, files in os.walk(path):
        for file in files:
            filePath = os.path.join(subdir, file)
            sizeList[os.stat(filePath).st_size].append(filePath)
    print("Hashing " + str(len(sizeList)) + " Files")
    ## Hash remaining files
    fileList = defaultdict(list)
    for fileSize in sizeList.values():
        if len(fileSize) > 1:
            for dupSize in fileSize:
                fileList[hash(dupSize)].append(dupSize)
    ## Write remaining hashed files to file
    print("Writing Output")
    for fileHash in fileList.values():
        if len(fileHash) > 1:
            for hashOut in fileHash:
                outFile.write(hashOut + " ~ " + str(os.stat(hashOut).st_size) + '\n')
            outFile.write('\n')
outFile.close()
print("End time: " + str(datetime.datetime.now()))
  • What are you asking? If you want code review [go here](http://codereview.stackexchange.com/). If you're asking how to solve a specific problem (you haven't) you're trying to complete, please make it clearer. – yuvi Sep 21 '15 at 21:22
  • Are you trying to find duplicates by filename? If so, you could try what you said of adding each filename to a database and then checking if it exists before insert, output dupe otherwise. Another method could be to write out all the filenames to csv and use a library like dask which will chunk up your task. – postelrich Sep 21 '15 at 21:25
  • I am attempting to find duplicates based on file content. – user3582260 Sep 21 '15 at 21:26
  • 1
    Stack Overflow is a question-and-answer site. It is a problem-and-resolution site. Your post is rich in detail and precisely describes your situation, but it is still missing the essential ingredient: a question. What are you trying to ask? – Robᵩ Sep 21 '15 at 21:30
  • Thanks for editing your question in. Have you tried a third party library to make hashing faster maybe? Googling I found [this](https://github.com/Cyan4973/xxHash). Also, you should benchmark different ways to loop (but take 10K files or something smaller so you don't have to run 4 hours testing every time). list comprehensions can [sometimes be faster](http://stackoverflow.com/questions/22108488/are-list-comprehensions-and-functional-functions-faster-than-for-loops) than normal for loops – yuvi Sep 21 '15 at 21:30
  • Hi Rob - Please see section starting with "Question". My primary objective is to hash the content of 13+ million files to find duplicates. The faster the better, but I would rather have accurate results. My current scripts breaks with the error above. – user3582260 Sep 21 '15 at 21:36
  • Oops, thought you were using Unix/Linux. If you were, you could use this shell one-liner: `find . -type f -print0 | xargs -0 md5sum > md5 && sort md5 | uniq -D -w32 > dupes`. This will store all of your md5 values in the file `./md5`, and list all duplicated files in `dupes`. – Robᵩ Sep 21 '15 at 21:40
  • Yuvi - Thank you for your response. I'm not too concerned with speed, but I would prefer to keep the entire process contained with python. The server has 64 GB of ram so I'm having a hard time understanding the "memory error". – user3582260 Sep 21 '15 at 21:41
  • Rob - I have roughly 80+ TB that will need to be processed and currently resides on several Windows machines. I would prefer to keep it on the Windows side. – user3582260 Sep 21 '15 at 21:42
  • Yep, I just noticed that you were using Windows. I thought it was Unix/Linux. Never mind. – Robᵩ Sep 21 '15 at 21:43

1 Answers1

0

Disclaimer: I don't know if this is a solution.

I looked at your code, and I realized the error is provoked by .walk. Now it's true that this might be because of too much info being processed (so maybe an external DB would help matters, though the added operations might hinder your speed). But other than that, .listdir (which is called by .walk) is really terrible when you handle a huge amount of files. Hopefully, this is resolved in Python 3.5 because it implements the way better scandir, so if you're willing* to try the latest (and I do mean latest, it was release, what, 8 days ago?), that might help.

Other than that you can try and trace bottlenecks, and garbage collection to maybe figure it out.

*you can also just install it with pip using your current python, but where's the fun in that?

Community
  • 1
  • 1
yuvi
  • 18,155
  • 8
  • 56
  • 93