1

I have two directories containing a bunch of files and subfolders. I would like to check if the file contents are the same in both directories (ignoring the file name). The subfolder structure should be the same too.

I looked at filecmp.dircmp but this is not helping because it does not consider the file content; there is no shallow=False option with filecmp.dircmp(), see here.

The workaround in this SO answer does not work either, because it considers the file names.

What's the best way to do my comparison?

langlauf.io
  • 3,009
  • 2
  • 28
  • 45
  • So you want to compare every file in one dir to every file in another dir to find if there is a possible match? That seems an incredibly long task, and maybe an [xy](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Could you clarify why you want to do this? You basically want the workaround, but allowing for a match between any two pairs of files. – kabanus Feb 11 '19 at 10:39
  • Yes, the workaround looks good, except for the fact that it considers file names (and other os.stat data I suppose). – langlauf.io Feb 11 '19 at 10:48
  • Can you address my other question? If you have two directories with 100 files named differently, it seems like worst case you will be comparing files 10000 times. This seems excessive, especially with big files. – kabanus Feb 11 '19 at 10:50
  • I want to do this because I need to know if two folders have the the same structure and contain the same files. If yes, I have a "douplicate" and can delete one of the two. – langlauf.io Feb 11 '19 at 10:51
  • The worst case in not likely if I try to stop the comparison as soon as possible, e.g. by comparing the total size first, and comparing the number of files first etc. – langlauf.io Feb 11 '19 at 10:52
  • So basically, the folder trees should look the same (each folder in each level of the tree contains the same number of files and the same number of sub-directories), but all the names do not matter? – kabanus Feb 11 '19 at 10:53
  • Yes, this is correct. The folder trees should look the same but also the file contents. All names do not matter. – langlauf.io Feb 11 '19 at 10:54

1 Answers1

1

Got around to this. After minor testing this seems to work, though more is needed. Again, this can be extremely long, depending both on the amount of files and their size:

import filecmp
import os
from collections import defaultdict
from sys import argv

def compareDirs(d1,d2):
    files1 = defaultdict(set)
    files2 = defaultdict(set)
    subd1  = set()
    subd2  = set()
    for entry in os.scandir(d1):
        if entry.is_dir(): subd1.add(entry)
        else: files1[os.path.getsize(entry)].add(entry)
    #Collecting first to compare length since we are guessing no
    #match is more likely. Can compare files directly if this is
    # not true.
    for entry in os.scandir(d2):
        if entry.is_dir(): subd2.add(entry)
        else: files2[os.path.getsize(entry)].add(entry)

    #Structure not the same. Checking prior to content.
    if len(subd1) != len(subd2) or len(files1) != len(files2): return False

    for size in files2:
        for entry in files2[size]:
            for fname in files1[size]: #If size does not exist will go to else
                if filecmp.cmp(fname,entry,shallow=False): break
            else: return False
            files1[size].remove(fname)
            if not files1[size]: del files1[size]

    #Missed a file
    if files1: return False

    #This is enough since we checked lengths - if all sd2 are matched, sd1
    #will be accounted for.
    for sd1 in subd1:
        for sd2 in subd2:
            if compareDirs(sd1,sd2): break
        else: return False #Did not find a sub-directory
        subd2.remove(sd2)

    return True

print(compareDirs(argv[1],argv[2]))

Recursively enter both directories. Compare files on the first level - fail if no match. Then try and match any sub-dir in the first directory to any sub-dir in the next recursively, until all are matched.

This is the most naive solution. Possibly traversing the tree and only matching sizes and structure would be beneficial in the average case. In that case the function would look similar, except we compare getsize instead of using filecmp, and save the matching tree structures, so the second run would be faster.

Of course, in case of a few sub-directories with the exact same structures and sizes we would still need to compare all possibilities of matching.

kabanus
  • 24,623
  • 6
  • 41
  • 74