I suggest that you compute a "hash" from the contents of each file. Make a dictionary where the keys are the hash values, and the values are lists of filenames.
The Python hashlib
module has multiple hash algorithms you could use. I suggest SHA-1 or MD-5.
It is very, very unlikely for two files that are not identical to have the same hash value. If you wanted to make absolutely sure, you could loop over a list of files and compare the actual file values to make sure they really are the same.
You can use defaultdict
to make this even easier: How does collections.defaultdict work?
This is just untested pseudocode, but do something like:
from collections import defaultdict
import hashlib
h = defaultdict(list)
for filename in list_of_files_in_directory:
with open(filename, "r") as f:
data = f.read()
fhash = hashlib.sha1(data).hexdigest()
h[fhash].append(filename)
# h now contains a key for each unique file contents hash, and a list of filenames for each key
Your dictionary could just use the binary hash data as keys, but it is more convenient to use string values. the .hexdigest()
method function gives you a string representing the hash as hexadecimal digits.
EDIT: In a comment, @parchment suggests using os.stat()
to get the file size, and only computing the file hash when there are multiple files with the same size. This is an excellent way to speed up the process of finding identical files; if you only have one file that has a particular length, you know it cannot be identical to any of the other files. And if the files are large, computing the hashes might be slow.
But I would suggest writing the simple hashing code first, and get it working, and then if you have time try rewriting it to check the file size. Code that checks file size and then sometimes also hashes the file will be more complicated and thus more difficult to get right.
Off the top of my head, here is how I would rewrite to use the file sizes:
Make an empty list called done
. This is where you will store your output (lists of filenames for which the contents are identical).
Make a dictionary mapping file length to a list of filenames. You can use defaultdict
as shown above.
Loop over the dictionary. Each value that is a list containing a single filename, just append that value to the done
list; a unique length means a unique file. Each value that is a list of two or more files, you need to now compute the hashes and build another dictionary mapping hashes to lists of files with that hash. When done, just loop over all values in this dictionary and add them to done
. Basically this part is the same code as the solution that hashes all files; it's just that now you don't need to hash every file, just files that have non-unique lengths.