44

From what I observe filecmp.dircmp is recursive, but inadequate for my needs, at least in py2. I want to compare two directories and all their contained files. Does this exist, or do I need to build (using os.walk, for example). I prefer pre-built, where someone else has already done the unit-testing :)

The actual 'comparison' can be sloppy (ignore permissions, for example), if that helps.

I would like something boolean, and report_full_closure is a printed report. It also only goes down common subdirs. AFIAC, if they have anything in the left or right dir only those are different dirs. I build this using os.walk instead.

jww
  • 97,681
  • 90
  • 411
  • 885
Gregg Lind
  • 20,690
  • 15
  • 67
  • 81
  • What is "AFIAC"? Can't see it in any common acronym when searching. – LondonAppDev Oct 04 '20 at 16:09
  • @LondonAppDev my guess is that the author meant AFAIC (as far as i'm concerned (e.g. https://www.oxfordlearnersdictionaries.com/us/definition/english/afaic#:~:text=abbreviation%20(informal),messages%2C%20social%20media%2C%20etc.)) but possibly misspelled it as AFIAC. Closely related to AFAIK, i.e. as far as i know. – Bill Sep 26 '21 at 17:59

15 Answers15

36

Here's an alternative implementation of the comparison function with filecmp module. It uses a recursion instead of os.walk, so it is a little simpler. However, it does not recurse simply by using common_dirs and subdirs attributes since in that case we would be implicitly using the default "shallow" implementation of files comparison, which is probably not what you want. In the implementation below, when comparing files with the same name, we're always comparing only their contents.

import filecmp
import os.path

def are_dir_trees_equal(dir1, dir2):
    """
    Compare two directories recursively. Files in each directory are
    assumed to be equal if their names and contents are equal.

    @param dir1: First directory path
    @param dir2: Second directory path

    @return: True if the directory trees are the same and 
        there were no errors while accessing the directories or files, 
        False otherwise.
   """

    dirs_cmp = filecmp.dircmp(dir1, dir2)
    if len(dirs_cmp.left_only)>0 or len(dirs_cmp.right_only)>0 or \
        len(dirs_cmp.funny_files)>0:
        return False
    (_, mismatch, errors) =  filecmp.cmpfiles(
        dir1, dir2, dirs_cmp.common_files, shallow=False)
    if len(mismatch)>0 or len(errors)>0:
        return False
    for common_dir in dirs_cmp.common_dirs:
        new_dir1 = os.path.join(dir1, common_dir)
        new_dir2 = os.path.join(dir2, common_dir)
        if not are_dir_trees_equal(new_dir1, new_dir2):
            return False
    return True
Mateusz Kobos
  • 621
  • 8
  • 8
  • The second `filecmp.cmpfiles` with `shallow=False` is not necessary. One can get `dirs_cmp.diff_files` from the first dircmp. A common misunderstanding (one that we made as well!) is that dir_cmp is shallow only and doesn't compare file contents! Turns out that is not true! The meaning of shallow=True is only to save time, and does not actually consider two files with differing last modification times to be different. If they the last mod time is different, it moves into reading each file's contents and comparing. If contents are identical, then it's a match even if last mod is different. – AdamE Nov 16 '21 at 00:52
  • `import os` should be enough – mountrix Jan 29 '22 at 20:54
25

filecmp.dircmp is the way to go. But it does not compare the content of files found with the same path in two compared directories. Instead filecmp.dircmp only looks at files attributes. Since dircmp is a class, you fix that with a dircmp subclass and override its phase3 function that compares files to ensure content is compared instead of only comparing os.stat attributes.

import filecmp

class dircmp(filecmp.dircmp):
    """
    Compare the content of dir1 and dir2. In contrast with filecmp.dircmp, this
    subclass compares the content of files with the same path.
    """
    def phase3(self):
        """
        Find out differences between common files.
        Ensure we are using content comparison with shallow=False.
        """
        fcomp = filecmp.cmpfiles(self.left, self.right, self.common_files,
                                 shallow=False)
        self.same_files, self.diff_files, self.funny_files = fcomp

Then you can use this to return a boolean:

import os.path

def is_same(dir1, dir2):
    """
    Compare two directory trees content.
    Return False if they differ, True is they are the same.
    """
    compared = dircmp(dir1, dir2)
    if (compared.left_only or compared.right_only or compared.diff_files 
        or compared.funny_files):
        return False
    for subdir in compared.common_dirs:
        if not is_same(os.path.join(dir1, subdir), os.path.join(dir2, subdir)):
            return False
    return True

In case you want to reuse this code snippet, it is hereby dedicated to the Public Domain or the Creative Commons CC0 at your choice (in addition to the default license CC-BY-SA provided by SO).

Philippe Ombredanne
  • 2,017
  • 21
  • 36
  • 1
    FWIW, this code snippet is released to the public domain or under CC0 1.0 at your choice. – Philippe Ombredanne Dec 29 '15 at 10:10
  • 1
    Note: To have this work recursively, you will also have to: (1) override the methodmap attribute (2) override phase4 so that subdir will yield instances of your class. – Vidar Jul 26 '17 at 13:34
  • @Vidar are you sure? Do you have an example that would not work? Recursion is handled in `is_same` here. – Philippe Ombredanne Jul 26 '17 at 14:26
  • True, you side step the issue of recursion by `subdirs` here, but the class variable `methodmap` still needs to be overridden, otherwise your custom `phase3` will never be called. UPDATE: At least for Python 3.5.3. – Vidar Jul 31 '17 at 13:30
  • @Vidar Thanks! good insight... I only tested this with Python 2.x – Philippe Ombredanne Aug 01 '17 at 11:43
  • 1
    I opened https://github.com/python/cpython/pull/5088 with some changes to make it easier to subclass. – Mitar Jan 03 '18 at 07:34
  • Why isn't there a `shallow` argument for `filecmp.dircmp` as there is one for `filecmp.cmp()`? – langlauf.io Feb 11 '19 at 10:19
  • @langlauf.io There is an open bug for adding `shallow` kwarg to `dircmp` at https://bugs.python.org/issue12932. Looks like it's just gone stale... – Nick Crews Nov 19 '20 at 21:56
  • 1
    @Vidar Subclassing dircmp now works as expected after https://github.com/python/cpython/pull/23424, should be released in python 3.10 – Nick Crews Nov 23 '20 at 18:52
9

Here a simple solution with a recursive function :

import filecmp

def same_folders(dcmp):
    if dcmp.diff_files or dcmp.left_only or dcmp.right_only:
        return False
    for sub_dcmp in dcmp.subdirs.values():
        if not same_folders(sub_dcmp):
            return False
    return True

same_folders(filecmp.dircmp('/tmp/archive1', '/tmp/archive2'))
mike rodent
  • 14,126
  • 11
  • 103
  • 157
Guillaume Vincent
  • 13,355
  • 13
  • 76
  • 103
  • 1
    Nice ... but are you aware that `diff_files` doesn't tell the whole story (just files with the same path, but which have been modified)? You also have to check for `dircmp.left_only` and `dircmp.right only`, respectively new and deleted files in the source. – mike rodent Feb 25 '21 at 20:54
6

The report_full_closure() method is recursive:

comparison = filecmp.dircmp('/directory1', '/directory2')
comparison.report_full_closure()

Edit: After the OP's edit, I would say that it's best to just use the other functions in filecmp. I think os.walk is unnecessary; better to simply recurse through the lists produced by common_dirs, etc., although in some cases (large directory trees) this might risk a Max Recursion Depth error if implemented poorly.

asthasr
  • 9,125
  • 1
  • 29
  • 43
3

Another solution to Compare the lay out of dir1 and dir2, ignore the content of files

See gist here: https://gist.github.com/4164344

Edit: here's the code, in case the gist gets lost for some reason:

import os

def compare_dir_layout(dir1, dir2):
    def _compare_dir_layout(dir1, dir2):
        for (dirpath, dirnames, filenames) in os.walk(dir1):
            for filename in filenames:
                relative_path = dirpath.replace(dir1, "")
                if os.path.exists( dir2 + relative_path + '\\' +  filename) == False:
                    print relative_path, filename
        return

    print 'files in "' + dir1 + '" but not in "' + dir2 +'"'
    _compare_dir_layout(dir1, dir2)
    print 'files in "' + dir2 + '" but not in "' + dir1 +'"'
    _compare_dir_layout(dir2, dir1)


compare_dir_layout('xxx', 'yyy')
Clare Macrae
  • 3,670
  • 2
  • 31
  • 45
Raullen Chai
  • 315
  • 2
  • 2
3

dircmp can be recursive: see report_full_closure.

As far as I know dircmp does not offer a directory comparison function. It would be very easy to write your own, though; use left_only and right_only on dircmp to check that the files in the directories are the same and then recurse on the subdirs attribute.

Katriel
  • 120,462
  • 19
  • 136
  • 170
3

This recursive function seems to work for me:

def has_differences(dcmp):
    differences = dcmp.left_only + dcmp.right_only + dcmp.diff_files
    if differences:
        return True
    return any([has_differences(subdcmp) for subdcmp in dcmp.subdirs.values()])

Assuming I haven't overlooked anything, you could just negate the result if you wanna know if directories are the same:

from filecmp import dircmp

comparison = dircmp("dir1", "dir2")
same = not has_differences(comparison)
oats
  • 378
  • 2
  • 7
2

Since a True or False result is all you want, if you have diff installed:

def are_dir_trees_equal(dir1, dir2):
    process = Popen(["diff", "-r", dir1, dir2], stdout=PIPE)
    exit_code = process.wait()
    return not exit_code
Brent
  • 4,153
  • 4
  • 30
  • 63
  • One problem I have with this is that, unfortunately, there is no way for you to say "stop when you have found a difference". There should be! With large directories `diff` can take huge amounts of time to complete, and in this use case, unnecessarily. – mike rodent Feb 25 '21 at 20:51
1

Based on python issue 12932 and filecmp documentation you may use following example:

import os
import filecmp

# force content compare instead of os.stat attributes only comparison
filecmp.cmpfiles.__defaults__ = (False,)

def _is_same_helper(dircmp):
    assert not dircmp.funny_files
    if dircmp.left_only or dircmp.right_only or dircmp.diff_files or dircmp.funny_files:
        return False
    for sub_dircmp in dircmp.subdirs.values():
       if not _is_same_helper(sub_dircmp):
           return False
    return True

def is_same(dir1, dir2):
    """
    Recursively compare two directories
    :param dir1: path to first directory 
    :param dir2: path to second directory
    :return: True in case directories are the same, False otherwise
    """
    if not os.path.isdir(dir1) or not os.path.isdir(dir2):
        return False
    dircmp = filecmp.dircmp(dir1, dir2)
    return _is_same_helper(dircmp)
alzix
  • 23
  • 5
0
def same(dir1, dir2):
"""Returns True if recursively identical, False otherwise

"""
    c = filecmp.dircmp(dir1, dir2)
    if c.left_only or c.right_only or c.diff_files or c.funny_files:
        return False
    else:
        safe_so_far = True
        for i in c.common_dirs:
            same_so_far = same_so_far and same(os.path.join(frompath, i), os.path.join(topath, i))
            if not same_so_far:
                break
        return same_so_far
NotAUser
  • 1,436
  • 8
  • 12
0

Here is my solution: gist

def dirs_same_enough(dir1,dir2,report=False):
    ''' use os.walk and filecmp.cmpfiles to
    determine if two dirs are 'same enough'.

    Args:
        dir1, dir2:  two directory paths
        report:  if True, print the filecmp.dircmp(dir1,dir2).report_full_closure()
                 before returning

    Returns:
        bool

    '''
    # os walk:  root, list(dirs), list(files)
    # those lists won't have consistent ordering,
    # os.walk also has no guaranteed ordering, so have to sort.
    walk1 = sorted(list(os.walk(dir1)))
    walk2 = sorted(list(os.walk(dir2)))

    def report_and_exit(report,bool_):
        if report:
            filecmp.dircmp(dir1,dir2).report_full_closure()
            return bool_
        else:
            return bool_

    if len(walk1) != len(walk2):
        return false_or_report(report)

    for (p1,d1,fl1),(p2,d2,fl2) in zip(walk1,walk2):
        d1,fl1, d2, fl2 = set(d1),set(fl1),set(d2),set(fl2)
        if d1 != d2 or fl1 != fl2:
            return report_and_exit(report,False)
        for f in fl1:
            same,diff,weird = filecmp.cmpfiles(p1,p2,fl1,shallow=False)
            if diff or weird:
                return report_and_exit(report,False)

    return report_and_exit(report,True)
Gregg Lind
  • 20,690
  • 15
  • 67
  • 81
0

This will check if files are in the same locations and if their content is the same. It will not correctly validate for empty subfolders.

import filecmp
import glob
import os

path_1 = '.'
path_2 = '.'

def folders_equal(f1, f2):
    file_pairs = list(zip(
        [x for x in glob.iglob(os.path.join(f1, '**'), recursive=True) if os.path.isfile(x)],
        [x for x in glob.iglob(os.path.join(f2, '**'), recursive=True) if os.path.isfile(x)]
    ))

    locations_equal = any([os.path.relpath(x, f1) == os.path.relpath(y, f2) for x, y in file_pairs])
    files_equal = all([filecmp.cmp(*x) for x in file_pairs]) 

    return locations_equal and files_equal

folders_equal(path_1, path_2)
Rok
  • 406
  • 3
  • 6
0

To anyone looking for a simple library:

https://github.com/mitar/python-deep-dircmp

DeepDirCmp basically subclasses filecmp.dircmp and shows output identical to diff -qr dir1 dir2.

Usage:

from deep_dircmp import DeepDirCmp

cmp = DeepDirCmp(dir1, dir2)
if len(cmp.get_diff_files_recursive()) == 0:
    print("Dirs match")
else:
    print("Dirs don't match")
Gh0sT
  • 317
  • 5
  • 16
0

Based on @Mateusz Kobos currently accepted answer, it turns out that the second filecmp.cmpfiles with shallow=False is not necessary, so we've removed it. One can get dirs_cmp.diff_files from the first dircmp. A common misunderstanding (one that we made as well!) is that dir_cmp is shallow only and doesn't compare file contents! Turns out that is not true! The meaning of shallow=True is only to save time, and does not actually consider two files with differing last modification times to be different. If the last modified time is different between two files, it moves into reading each file's contents and comparing their contents. If contents are identical, then it's a match even if last modification date is different! We've added verbose prints here for added clarity. See elsewhere (filecmp.cmp() ignoring differing os.stat() signatures?) if you want to consider differences in st_modtime to be considered a mismatch. We also changed to use newer pathlib instead of os library.

import filecmp
from pathlib import Path

def compare_directories_recursive(dir1:Path, dir2:Path,verbose=True):
"""
Compares two directories recursively. 
First, file counts in each directory are compared. 
Second, files are assumed to be equal if their names, size and last modified date are equal (aka shallow=True in python terms)
If last modified date is different, then the contents are compared by reading each file. 
Caveat: if the contents are equal and last modified is NOT equal, files are still considered equal! 
This caveat is the default python filecmp behavior as unintuitive as it may seem.

@param dir1: First directory path
@param dir2: Second directory path
"""

dirs_cmp = filecmp.dircmp(str(dir1), str(dir2))
if len(dirs_cmp.left_only)>0:
    if verbose:
        print(f"Should not be any more files in original than in destination left_only: {dirs_cmp.left_only}")
    return False
if len(dirs_cmp.right_only)>0:
    if verbose:
        print(f"Should not be any more files in destination than in original right_only: {dirs_cmp.right_only}")
    return False
if len(dirs_cmp.funny_files)>0:
    if verbose:
        print(f"There should not be any funny files between original and destination. These file(s) are funny {dirs_cmp.funny_files}")
    return False
if len(dirs_cmp.diff_files)>0:
    if verbose:
        print(f"There should not be any different files between original and destination. These file(s) are different {dirs_cmp.diff_files}")
    return False

for common_dir in dirs_cmp.common_dirs:
    new_dir1 = Path(dir1).joinpath(common_dir)
    new_dir2 = Path(dir2).joinpath(common_dir)
    if not compare_directories_recursive(new_dir1, new_dir2):
        return False
return True
AdamE
  • 606
  • 6
  • 11
0

Here's a tiny hack without our own recursion and algorithm:

import contextlib
import filecmp
import io
import re

def are_dirs_equal(a, b) -> bool:
    stdout = io.StringIO()
    with contextlib.redirect_stdout(stdout):
        filecmp.dircmp(a, b).report_full_closure()
    return re.search("Differing files|Only in", stdout.getvalue()) is None
Nelson Yeung
  • 3,262
  • 3
  • 19
  • 29