2

Trying to compare our current project media server (dir1) with a backup (dir2) to see what documents were deleted. Both are windows directories. Many of the files have been shuffled around into new sub-directories but are not missing. Because the directory structure has changed using recursion and filecmp.dircmp per this post won't work: Recursively compare two directories to ensure they have the same files and subdirectories

The other considerations is that different files will have the same file name, so comparison will need to compare file size, modification date, etc to determine if two files are the same.

What I want sudo-code:

def find_missing_files(currentDir, backup):
    <does stuff>
    return <List of Files in backup that are not in currentDir>

What I have:

def build_file_list(someDir, fileList = []):
    for root, dirs, files in os.walk(someDir):
        if files:
            for file in files:
                filePath = os.path.join(root, file)
                if filePath not in fileList:
                    fileList.append(filePath)
    return fileList

def cmp_file_lists(dir1, dir2):
    dir1List = build_file_list(dir1)
    dir2List = build_file_list(dir2)

    for dir2file in dir2List:
        for dir1file in dir1List:
            if filecmp.cmp(dir1file, dir2file):
                dir1List.remove(dir1file)
                dir2List.remove(dir2file)
                break
    return (dir1List, dir2List)

EDIT: in above code I am having an issue where dir2List.remove(dir2file) throw error that dir2file is not in dir2List because (it appears) somehow both dir2list and dir1List are the same object. Dunno how that is happening.

I don't know if this could more easily be done with filecmp.dircmp but I am missing it? or if this is the best approach to achieve what I am looking for? ...or should I take each file from dir2 and us os.walk to look for it in dir1?

1 Answers1

1

May I suggest an alternative? Using pathlib and it's rglob method, everything is much easier (if you really are agnostic about subdirectories):

from pathlib import Path

def cmp_file_lists(dir1, dir2):
    dir1_filenames = set(f.name for f in Path(dir1).rglob('*'))
    dir2_filenames = set(f.name for f in Path(dir2).rglob('*'))
    files_in_dir1_but_not_dir2 = dir1_filenames - dir2_filenames 
    files_in_dir2_but_not_dir1 = dir2_filenames - dir1_filenames 
    return dir1_filenames, dir2_filenames
Ofer Sadan
  • 11,391
  • 5
  • 38
  • 62
  • Experimenting with this... but does it not just compare filenames? I will have different files with the same filename (in different directories obviously) Hence the reason to use filecmp.cmp which looks for other stats to compare ....I think. Also it should return files_in_dir2_but_not_dir1 and files_in_dir1_but_not_dir2, no? – constdoc constdoc Oct 23 '19 at 20:44
  • @constdocconstdoc it will compare only names, that's true, but you didn't specify what `filecmp.cmp` does. You can always add that level of comparison to this code, or alternatively, compare the files with more than just `name` in the set. The basic idea here is to show you an alternative in searching through a directory, what you do with that is up to you – Ofer Sadan Oct 23 '19 at 20:47
  • I appreciate that you have highlighted a method for building the list of files but comparing the files was the question here. Nonetheless, this may help build the lists, if i do end up using lists of files. – constdoc constdoc Oct 23 '19 at 21:24