I know that this question has been asked before, and I've saw some of the answers, but this question is more about my code and the best way of accomplishing this task.
I want to scan a directory and see if there are any duplicates (by checking MD5 hashes) in that directory. The following is my code:
import sys
import os
import hashlib
fileSliceLimitation = 5000000 #bytes
# if the file is big, slice trick to avoid to load the whole file into RAM
def getFileHashMD5(filename):
retval = 0;
filesize = os.path.getsize(filename)
if filesize > fileSliceLimitation:
with open(filename, 'rb') as fh:
m = hashlib.md5()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
retval = m.hexdigest()
else:
retval = hashlib.md5(open(filename, 'rb').read()).hexdigest()
return retval
searchdirpath = raw_input("Type directory you wish to search: ")
print ""
print ""
text_file = open('outPut.txt', 'w')
for dirname, dirnames, filenames in os.walk(searchdirpath):
# print path to all filenames.
for filename in filenames:
fullname = os.path.join(dirname, filename)
h_md5 = getFileHashMD5 (fullname)
print h_md5 + " " + fullname
text_file.write("\n" + h_md5 + " " + fullname)
# close txt file
text_file.close()
print "\n\n\nReading outPut:"
text_file = open('outPut.txt', 'r')
myListOfHashes = text_file.read()
if h_md5 in myListOfHashes:
print 'Match: ' + " " + fullname
This gives me the following output:
Please type in directory you wish to search using above syntax: /Users/bubble/Desktop/aF
033808bb457f622b05096c2f7699857v /Users/bubble/Desktop/aF/.DS_Store
409d8c1727960fddb7c8b915a76ebd35 /Users/bubble/Desktop/aF/script copy.py
409d8c1727960fddb7c8b915a76ebd25 /Users/bubble/Desktop/aF/script.py
e9289295caefef66eaf3a4dffc4fe11c /Users/bubble/Desktop/aF/simpsons.mov
Reading outPut:
Match: /Users/bubble/Desktop/aF/simpsons.mov
My idea was:
1) Scan directory 2) Write MD5 hashes + Filename to text file 3) Open text file as read only 4) Scan directory AGAIN and check against text file...
I see that this isn't a good way of doing it AND it doesn't work. The 'match' just prints out the very last file that was processed.
How can I get this script to actually find duplicates? Can someone tell me a better/easier way of accomplishing this task.
Thank you very much for any help. Sorry this is a long post.