2

I know that this question has been asked before, and I've saw some of the answers, but this question is more about my code and the best way of accomplishing this task.

I want to scan a directory and see if there are any duplicates (by checking MD5 hashes) in that directory. The following is my code:

import sys
import os
import hashlib

fileSliceLimitation = 5000000 #bytes

# if the file is big, slice trick to avoid to load the whole file into RAM
def getFileHashMD5(filename):
     retval = 0;
     filesize = os.path.getsize(filename)

     if filesize > fileSliceLimitation:
        with open(filename, 'rb') as fh:
          m = hashlib.md5()
          while True:
            data = fh.read(8192)
            if not data:
                break
            m.update(data)
          retval = m.hexdigest()

     else:
        retval = hashlib.md5(open(filename, 'rb').read()).hexdigest()

     return retval

searchdirpath = raw_input("Type directory you wish to search: ")
print ""
print ""    
text_file = open('outPut.txt', 'w')

for dirname, dirnames, filenames in os.walk(searchdirpath):
    # print path to all filenames.
    for filename in filenames:
        fullname = os.path.join(dirname, filename)
        h_md5 = getFileHashMD5 (fullname)
        print h_md5 + " " + fullname
        text_file.write("\n" + h_md5 + " " + fullname)   

# close txt file
text_file.close()


print "\n\n\nReading outPut:"
text_file = open('outPut.txt', 'r')

myListOfHashes = text_file.read()

if h_md5 in myListOfHashes:
    print 'Match: ' + " " + fullname

This gives me the following output:

Please type in directory you wish to search using above syntax: /Users/bubble/Desktop/aF

033808bb457f622b05096c2f7699857v /Users/bubble/Desktop/aF/.DS_Store
409d8c1727960fddb7c8b915a76ebd35 /Users/bubble/Desktop/aF/script copy.py
409d8c1727960fddb7c8b915a76ebd25 /Users/bubble/Desktop/aF/script.py
e9289295caefef66eaf3a4dffc4fe11c /Users/bubble/Desktop/aF/simpsons.mov

Reading outPut:
Match:  /Users/bubble/Desktop/aF/simpsons.mov

My idea was:

1) Scan directory 2) Write MD5 hashes + Filename to text file 3) Open text file as read only 4) Scan directory AGAIN and check against text file...

I see that this isn't a good way of doing it AND it doesn't work. The 'match' just prints out the very last file that was processed.

How can I get this script to actually find duplicates? Can someone tell me a better/easier way of accomplishing this task.

Thank you very much for any help. Sorry this is a long post.

BubbleMonster
  • 1,366
  • 8
  • 32
  • 48
  • Actually, comparing MD5s is actually slower than comparing the files byte by byte (1 cycle per each byte of the files). I'm pretty sure MD5 takes more than one cycle per byte. I suggest you simply read a fixed amount of bytes from the file and keep comparing, that will be simpler and faster. – Ramchandra Apte Sep 10 '13 at 16:42
  • 2
    @RamchandraApte, but comparing every file to every other file is slow. Using MD5s can help avoid that. – senderle Sep 10 '13 at 16:44
  • 1
    @senderle You are right. But it might be faster simply to compare the first few and the last few bytes of each file, then if it matches compare the entire file. – Ramchandra Apte Sep 10 '13 at 16:46
  • 3
    To really get the speed up, first group files by file size and then only compare within the group. Unless there is some strange reason why the files are the same size, most files will have unique file sizes and will never need a scan. – tdelaney Sep 10 '13 at 16:47

3 Answers3

5

The obvious tool for identifying duplicates is a hash table. Unless you are working with a very large number of files, you could do something like this:

from collections import defaultdict

file_dict = defaultdict(list)
for filename in files:
    file_dict[get_file_hash(filename)].append(filename)

At the end of this process, file_dict will contain a list for every unique hash; when two files have the same hash, they'll both appear in the list for that hash. Then filter the dict looking for value lists longer than 1, and compare the files to make sure they're the same -- something like this:

for duplicates in file_dict.values():   # file_dict.itervalues() in Python 2
    if len(duplicates) > 1:
        # double-check reported duplicates and generate output

Or this:

duplicates = [files for files in file_dict.values() if len(files) > 1]

get_file_hash could use MD5s; or it could simply get the first and last bytes of the file as Ramchandra Apte suggested in the comments above; or it could simply use file sizes as tdelaney suggested in the comments above. Each of the latter two strategies are more likely to produce false positives though. You could combine them to reduce the false positive rate.

If you're working with a very large number of files, you could use a more sophisticated data structure like a Bloom Filter.

senderle
  • 145,869
  • 36
  • 209
  • 233
  • Would I need to add that into my code? Or re-write my code to get that working? – BubbleMonster Sep 10 '13 at 17:11
  • @BubbleMonster, The above would be close to a drop-in replacement for the block starting `for filename in filenames:` in your code. (You'd have to change `files` to `filenames` of course -- and put the `from... import... ` statement at the top of your module.) Then you'd have to write a bit more code to filter out the one-item lists and test the remaining lists to make sure the files really are duplicates. That part should be fairly straightforward though. – senderle Sep 10 '13 at 17:16
  • Why do the other strategies get false positives? I tested the code and you're right. It flags up lots of false positives, but why exactly? Thank you. – BubbleMonster Sep 12 '13 at 11:42
  • @BubbleMonster well it depends on the nature of the files, but in general, there's a good chance that two files will coincidentally have the same size and starting and ending content -- especially if the files are smaller, since there are fewer possible files in that case; or if they are all of the same type, in which case they may all begin the same way. On the other hand, MD5 will almost never (but will, _very_ rarely) generate false positives, because it's a cryptographic hash -- it's specifically designed to produce results that are evenly distributed across all possible hash values. – senderle Sep 12 '13 at 12:12
  • But even MD5 generates false positives, because there are only so many hash values -- only as many as can be represented by 128 bits. So if you tried all possible _129_-bit files, at least half of those would hash to the same value as another one, because there are more possible files than hashes. – senderle Sep 12 '13 at 12:15
  • Thanks for clearing that up. I've managed to implement your suggestion, so now, when I scan a folder, I get a dictionary with the MD5 as the key and the file name as the value. How can I make the script then print out that it has found 2 (or more) of the same hashes? Also, in a python book I have, it says: [quote]'In a dictionary, the key must be a hashable object. A dictionary cannot contain duplicate keys.'[end of quote] - So it is bad that I'm actually trying to see if a key has duplicate hashes? – BubbleMonster Sep 12 '13 at 13:10
  • 1
    I don't quite understand your last question. The key _is_ the hash. You're not trying to see if a key has duplicate hashes, you're trying to see if files have duplicate hashes. All files that have the same hash are stored in a list, which is the value mapped to the key -- the hash they share. See my edits for details about how to filter the dictionary. – senderle Sep 12 '13 at 13:54
3

@senderle has a great answer, but since he mentioned that my solution will produce false positives, I figured the gauntlet had been laid and I'd better show some code. I thinned down your md5 function (it should always use the 'fileSliceLimitation' case and should be less stingy with its input buffer), then prefiltered by size before doing the md5s.

import sys
import os
import hashlib
from collections import defaultdict

searchdirpath = sys.argv[1]

size_map = defaultdict(list)

def getFileHashMD5(filename):
    m = hashlib.md5()
    with open(filename, 'rb', 1024*1024) as fh:
          while True:
            data = fh.read(1024*1024)
            if not data:
                break
            m.update(data)
    return m.hexdigest()

# group files by size
for dirname, dirnames, filenames in os.walk(searchdirpath):
    for filename in filenames:
        fullname = os.path.join(dirname, filename)
        size_map[os.stat(fullname).st_size].append(fullname)

# scan files of same size
for fullnames in size_map.itervalues():
    if len(fullnames) > 0:
        hash_map = defaultdict(list)
        for fullname in fullnames:
            hash_map[getFileHashMD5(fullname)].append(fullname)
        for fullnames in hash_map.itervalues():
            if len(fullnames) > 1:
                print "duplicates:"
                for fullname in fullnames:
                    print "   ", fullname

(EDIT)

There were several questions about this implementation that I will try to answer here:

1) why (1024*1024) size not '5000000'

Your original code read in 8192 (8 KiB) increments, which is very small for modern systems. You are likely to get better performance by grabbing more at once. 1024*1024 is 1048576 (1 MiB) bytes and was just a guess at a reasonable number. As for why I wrote it in such a strange way, 1000 (decimal kilobyte) is loved by people but 1024 (binary kibibyte) is loved by computers and file systems. I am in the habit of writing some_number*1024 so its easy to see that I'm refering to 1 KiB increments. 5000000 is a reasonable number too, but you should consider 5*1024*1024 (that is 5 MiB) so that you get something that is nicely aligned for the file system.

2) what does this bit do exactly: size_map = defaultdict(list)

It creates a 'defaultdict' which adds functionality to a regular dict object. A regular dict raises a KeyError exception when it is indexed by a non-existant key. defaultdict creates a default value and adds that key/value pair to the dict instead. In our case, size_map[some_size] says "give me the list of files of some_size and create a new empty list if you don't have one".

size_map[os.stat(fullname).st_size].append(fullname). This breaks down to:

stat = os.stat(fullname)
size = stat.st_size
filelist = size_map[size]    # this is the same as:
                             #    if size not in size_map:
                             #        size_map[size] = list()
                             #    filelist = size_map[size]
filelist.append(fullname)

3) sys.argv[1] I'm guessing the sys.argv[1] just makes the python py.py 'filepath' argument work (where filepath is the argv[1] ?

Yes, when you call a python script, sys.argv[0] is the name of the script and sys.argv[1:] (arg 1 and following) are any additional arguments given on the command line. I used sys.argv[1] as a quick way to test the script when I wrote it and you should change that to meet your needs.

tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • Thanks for the code. I just tried running it and it said: File "py.py", line 6, in searchdirpath = sys.argv[1] IndexError: list index out of range. Do you know how to fix that? Should I have left in my searchdirpath raw_input? – BubbleMonster Sep 10 '13 at 17:32
  • @BubbleMonster - I wrote it assuming it would be called like `py.py path\to\directory` (Windows) or `python py.py path/to/directory` (linux). You can change it to suit your needs. – tdelaney Sep 10 '13 at 17:34
0

The first thing you're going to want to do is save the h_md5's to a list as you loop through your files. Something like:

h_md5=[]

before you loop through your directory. And

h_md5.append(getFileHashMD5(fullname))

inside your loop. Now you have a list of hashes to compare with your output file instead of simply the last one you made in your loop.

Also, obviously, with your current code you are going to find one match for each file every time because you will find hash for that particular file itself in your list. So if you want to look for duplicates you are going to have to look for instances where two distinct matches are found.

edit: the answer above @senderle is a much better way to do this if you are willing to change your code.

derricw
  • 6,757
  • 3
  • 30
  • 34
  • Thanks for the answer. I thought it would be something like that. I get the following error when I added your code: AttributeError: 'str' object has no attribute 'append' – BubbleMonster Sep 10 '13 at 17:13
  • @BubbleMonster, yes that's probably because you are overwriting your list with a string. In my suggestion, rename h_md5 to h_md5List or something and that should solve it. – derricw Sep 10 '13 at 17:21