Python: Finding Identical Files And Grouping Them

Question

this is a component for a class assignment so I apologize if I cannot go in depth as I need to.

To summarize, I need to write a python function that groups all identical files (meaning files with identical content that have varying file names). The purpose of grouping them is to eventually create a dictionary of type {string: list} where the list is the group of identical files and the key (string) is simply the first entry in the group when sorted in alphabetical order. We are given a directory of files.

So far, I have a program that iterates through each file using glob and I am also using filecmp.cmp(file1,file2) to find identical files. What I am struggling with is the logic required to successfully compare what can be up to 1000 files. I'm sure there is a more pythonic way to do this task rather than compare file1 to file2, file1 to file3, etc.

In conclusion, I know how to iterate through the list of files and I know how to create a dictionary once I have my groups of identical files... I'm just a bit lost on how to efficiently get the groups of files.

Sample Implementation Have 7 files: A, AA, AAA, B, BB, C, D. Files A, AA, and AAA are the same and B and BB are the same while C and D are unique. My final dictionary should be:

{'A': [A, AA, AAA], 'B': [B, BB], 'C': [C], 'D': [D]}

Thanks in advance for your time!

score 4 · Answer 1 · edited May 23 '17 at 12:21

I suggest that you compute a "hash" from the contents of each file. Make a dictionary where the keys are the hash values, and the values are lists of filenames.

The Python hashlib module has multiple hash algorithms you could use. I suggest SHA-1 or MD-5.

It is very, very unlikely for two files that are not identical to have the same hash value. If you wanted to make absolutely sure, you could loop over a list of files and compare the actual file values to make sure they really are the same.

You can use defaultdict to make this even easier: How does collections.defaultdict work?

This is just untested pseudocode, but do something like:

from collections import defaultdict
import hashlib

h = defaultdict(list)

for filename in list_of_files_in_directory:
    with open(filename, "r") as f:
        data = f.read()
    fhash = hashlib.sha1(data).hexdigest()
    h[fhash].append(filename)

# h now contains a key for each unique file contents hash, and a list of filenames for each key

Your dictionary could just use the binary hash data as keys, but it is more convenient to use string values. the .hexdigest() method function gives you a string representing the hash as hexadecimal digits.

EDIT: In a comment, @parchment suggests using os.stat() to get the file size, and only computing the file hash when there are multiple files with the same size. This is an excellent way to speed up the process of finding identical files; if you only have one file that has a particular length, you know it cannot be identical to any of the other files. And if the files are large, computing the hashes might be slow.

But I would suggest writing the simple hashing code first, and get it working, and then if you have time try rewriting it to check the file size. Code that checks file size and then sometimes also hashes the file will be more complicated and thus more difficult to get right.

Off the top of my head, here is how I would rewrite to use the file sizes:

Make an empty list called done. This is where you will store your output (lists of filenames for which the contents are identical).

Make a dictionary mapping file length to a list of filenames. You can use defaultdict as shown above.

Loop over the dictionary. Each value that is a list containing a single filename, just append that value to the done list; a unique length means a unique file. Each value that is a list of two or more files, you need to now compute the hashes and build another dictionary mapping hashes to lists of files with that hash. When done, just loop over all values in this dictionary and add them to done. Basically this part is the same code as the solution that hashes all files; it's just that now you don't need to hash every file, just files that have non-unique lengths.

You could compare file sizes first with a call to stat, then hash the files with the same size, to improve performance. Hashing 1000 files could take a while. — parchment, Sep 22 '14 at 22:53

Python: Finding Identical Files And Grouping Them

1 Answers1