5

I am using os.walk to compare two folders, and see if they contain the exact same files. However, this only checks the file names. I want to ensure the file sizes are the same, and if they're different report back. Can you get the file size from os.walk?

Michael Berkowski
  • 267,341
  • 46
  • 444
  • 390
shoes
  • 1,003
  • 2
  • 9
  • 5
  • Note that file size equality doesn't guarantee that the files are the same; you may want to use the difflib module or compute a checksum. (Alas, the python site isn't responding for me at the moment, so I can't provide URL's.) That said, for file sizes see this previous question: http://stackoverflow.com/questions/2104080/how-to-check-file-size-in-python – GreenMatt Jul 21 '11 at 13:16

4 Answers4

11

The same way you get file size without using os.walk, with os.stat. You just need to remember to join with the root:

for root, dirs, files in os.walk(some_directory):
    for fn in files:
        path = os.path.join(root, fn)
        size = os.stat(path).st_size # in bytes

        # ...
Cat Plus Plus
  • 125,936
  • 27
  • 200
  • 224
2

os.path.getsize(path) can give you the filesize of the file, but having two files the same size does not always mean they are identical. You could read the content of the file and have an MD5 or Hash of it to compare against.

Meitham
  • 9,178
  • 5
  • 34
  • 45
  • 1
    File size not equal is a pretty good guarantee that the files are not identical, however. – Vatine Jul 21 '11 at 13:18
  • 1
    If you're not worried about people intentionally faking that the file is the same, there are checksum algorithms much faster than MD5. Looking at the last modified time of the file is also a good way to confirm identically sized files are the same if you don't want to have to open the file. – agf Jul 21 '11 at 13:48
1

FYI, there is a more efficient solution in Python 3:

import os

with os.scandir(rootdir) as it:
    for entry in it:
        if entry.is_file():
            filepath = entry.path # absolute path
            filesize = entry.stat().st_size

See os.DirEntry for more details about the variable entry.

Note that the above is not recursive (subfolders will not be explored). In order to get an os.walk-like behaviour, you might want to use the following:

from collections import namedtuple
from os.path import normpath, realpath
from os.path import join as pathjoin

_wrap_entry = namedtuple( 'DirEntryWrapper', 'name path islink size' )
def scantree( rootdir, follow_links=False, reldir='' ):
    visited = set()
    rootdir = normpath(rootdir)
    with os.scandir(rootdir) as it:
        for entry in it:
            if entry.is_dir():
                if not entry.is_symlink() or follow_links:
                    absdir = realpath(entry.path)
                    if absdir in visited: 
                        continue 
                    else: 
                        visited.add(absdir)
                    yield from scantree( entry.path, follow_links, pathjoin(reldir,entry.name) )
            else:
                yield _wrap_entry( 
                    pathjoin(reldir,entry.name), 
                    entry.path, 
                    entry.is_symlink(),
                    entry.stat().st_size )

and use it as

for entry in scantree(rootdir, follow_links=False):
    filepath = entry.path 
    filesize = entry.size
Jonathan H
  • 7,591
  • 5
  • 47
  • 80
1

As others have said: you can get the size with stat. However for doing comparisons between dirs you can use dircmp.

Douglas Leeder
  • 52,368
  • 9
  • 94
  • 137