29

I'm using this code to calculate hash value for a file:

m = hashlib.md5()
with open("calculator.pdf", 'rb') as fh:
    while True:
        data = fh.read(8192)
        if not data:
            break
        m.update(data)
    hash_value = m.hexdigest()

    print  hash_value

when I tried it on a folder "folder"I got

IOError: [Errno 13] Permission denied: folder

How could I calculate the hash value for a folder ?

funky-future
  • 3,716
  • 1
  • 30
  • 43
user3832061
  • 477
  • 3
  • 7
  • 12
  • 1
    For what purpose? Unique identification? Use the full folder path or inode. To identify its contents? Then iterate through its full contents and hash that. – Konrad Rudolph Jul 24 '14 at 15:12
  • 2
    You must calculate the hash value of all its files and the files of its subfolders. –  Jul 24 '14 at 15:13
  • Konrad's correct in that there's lots of ambiguity in the question. Another possibility he hasn't listed is hashing the directory entry meta-data, which could be used for a quick/rough check whether content has changed. BTW, some OSes do let you "open" a directory much as if it was a text file and the code above for files would already "work" for whatever metadata the directory "file" stream produced. As is, the question deserves to be closed unless the need or aim is clarified. – Tony Delroy Jul 25 '14 at 05:44
  • There's [this gist](https://gist.github.com/techtonik/5175896) with an imo cleaner code. There's also a designated package [checksumdir](https://pypi.python.org/pypi/checksumdir) and [dirtools](https://pypi.python.org/pypi/dirtools) that has hashing-capabilities. – funky-future Jul 25 '15 at 20:18

7 Answers7

23

Use checksumdir python package available for calculating checksum/hash of directory. It's available at https://pypi.python.org/pypi/checksumdir

Usage :

import checksumdir
hash = checksumdir.dirhash("c:\\temp")
print hash
Mifeet
  • 12,949
  • 5
  • 60
  • 108
Mangu Singh Rajpurohit
  • 10,806
  • 4
  • 68
  • 97
  • 3
    Note: `checksumdir` is not tested (even when claiming to be stable). Using it is (arguably) *less* trustworthy than using the recipe, that at least forces you to read the recipe. – Jorge Leitao Dec 16 '18 at 14:51
  • 1
    It doesn't seem to work for me across different operating systems (getting different hashes under MacOS and Debian) :( – Seub Apr 30 '21 at 10:19
20

Here is an implementation that uses pathlib.Path instead of relying on os.walk. It sorts the directory contents before iterating so it should be repeatable on multiple platforms. It also updates the hash with the names of files/directories, so adding empty files and directories will change the hash.

Version with type annotations (Python 3.6 or above):

import hashlib
from _hashlib import HASH as Hash
from pathlib import Path
from typing import Union


def md5_update_from_file(filename: Union[str, Path], hash: Hash) -> Hash:
    assert Path(filename).is_file()
    with open(str(filename), "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash.update(chunk)
    return hash


def md5_file(filename: Union[str, Path]) -> str:
    return str(md5_update_from_file(filename, hashlib.md5()).hexdigest())


def md5_update_from_dir(directory: Union[str, Path], hash: Hash) -> Hash:
    assert Path(directory).is_dir()
    for path in sorted(Path(directory).iterdir(), key=lambda p: str(p).lower()):
        hash.update(path.name.encode())
        if path.is_file():
            hash = md5_update_from_file(path, hash)
        elif path.is_dir():
            hash = md5_update_from_dir(path, hash)
    return hash


def md5_dir(directory: Union[str, Path]) -> str:
    return str(md5_update_from_dir(directory, hashlib.md5()).hexdigest())

Without type annotations:

import hashlib
from pathlib import Path


def md5_update_from_file(filename, hash):
    assert Path(filename).is_file()
    with open(str(filename), "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash.update(chunk)
    return hash


def md5_file(filename):
    return md5_update_from_file(filename, hashlib.md5()).hexdigest()


def md5_update_from_dir(directory, hash):
    assert Path(directory).is_dir()
    for path in sorted(Path(directory).iterdir()):
        hash.update(path.name.encode())
        if path.is_file():
            hash = md5_update_from_file(path, hash)
        elif path.is_dir():
            hash = md5_update_from_dir(path, hash)
    return hash


def md5_dir(directory):
    return md5_update_from_dir(directory, hashlib.md5()).hexdigest()

Condensed version if you only need to hash directories:

def md5_update_from_dir(directory, hash):
    assert Path(directory).is_dir()
    for path in sorted(Path(directory).iterdir(), key=lambda p: str(p).lower()):
        hash.update(path.name.encode())
        if path.is_file():
            with open(path, "rb") as f:
                for chunk in iter(lambda: f.read(4096), b""):
                    hash.update(chunk)
        elif path.is_dir():
            hash = md5_update_from_dir(path, hash)
    return hash


def md5_dir(directory):
    return md5_update_from_dir(directory, hashlib.md5()).hexdigest()

Usage: md5_hash = md5_dir("/some/directory")

IVI
  • 139
  • 7
danmou
  • 321
  • 2
  • 4
  • Is it by design your initializing the hash with the absolute path to the files and not say the relative path from a root directory? – nimig18 Aug 11 '20 at 03:42
  • @nimig18 I'm doing neither; only the file names (without path) are included in the hash. – danmou Aug 12 '20 at 06:17
  • 1
    The sorting of the files is very important for this to be repeatable as neither Path iterdir or os.walk guarantee a certain ordering and will be subject to the underlying os implementation. However, just sorting by case insensitive paths is not enough, since sorted is in-place and if two folders in linux are different only by case, sorted can return a different order different times depending on the specific order returned by iterdir/os.walk. The proper solution is to sort first by case, and then by case insensitivity. @danmou perhaps you could update this? – Matias Grioni Oct 18 '20 at 19:22
  • @MatiasGrioni It's been a while since I wrote this code and I don't remember why I did a case-insensitive sort in the first place. Shouldn't a normal case-sensitive sort do the trick? – danmou Jan 11 '21 at 17:24
  • It depends on what characteristics you want your hashing to have. There is no "natural" way to hash a folder, so there is some policy / characteristics you should define beforehand for your application and make sure the algorithm works for it. – Matias Grioni Jan 11 '21 at 18:39
  • So case-sensitive will work on linux, and windows, but it means that changing a folder or file name from "A" to "a" will change the final hash even though from one valid perspective (but not the only one), these are the same filesystem objects. – Matias Grioni Jan 11 '21 at 18:47
  • Another thing that I came across when using this is that the path.name.encode() hashing also has to be closed out. Otherwise, you don't know the exact structure and two different folder structures can hash the same. For example, (B -> (A), A) will hash the same as (B -> (A, A)). The arrow means folder name to file inside the folder. Edited: Wrong folder structure should be okay now. – Matias Grioni Jan 11 '21 at 18:51
10

This Recipe provides a nice function to do what you are asking. I've modified it to use the MD5 hash, instead of the SHA1, as your original question asks

def GetHashofDirs(directory, verbose=0):
  import hashlib, os
  SHAhash = hashlib.md5()
  if not os.path.exists (directory):
    return -1

  try:
    for root, dirs, files in os.walk(directory):
      for names in files:
        if verbose == 1:
          print 'Hashing', names
        filepath = os.path.join(root,names)
        try:
          f1 = open(filepath, 'rb')
        except:
          # You can't open the file for some reason
          f1.close()
          continue

        while 1:
          # Read file in as little chunks
          buf = f1.read(4096)
          if not buf : break
          SHAhash.update(hashlib.md5(buf).hexdigest())
        f1.close()

  except:
    import traceback
    # Print the stack traceback
    traceback.print_exc()
    return -2

  return SHAhash.hexdigest()

You can use it like this:

print GetHashofDirs('folder_to_hash', 1)

The output looks like this, as it hashes each file:

...
Hashing file1.cache
Hashing text.txt
Hashing library.dll
Hashing vsfile.pdb
Hashing prog.cs
5be45c5a67810b53146eaddcae08a809

The returned value from this function call comes back as the hash. In this case, 5be45c5a67810b53146eaddcae08a809

Andy
  • 49,085
  • 60
  • 166
  • 233
  • What is intented for the `import traceback` ? –  Jul 24 '14 at 15:24
  • 1
    @begueradj, In this case, [traceback](https://docs.python.org/2/library/traceback.html#traceback.print_exc), is used to print the stacktrace if an error occurs while hashing – Andy Jul 24 '14 at 15:33
  • 2
    Just ignoring a file you couldn't open doesn't sound as at the correct approach to me. Furthermore, [you cannot guarantee](http://stackoverflow.com/a/18282401/2436175) e.g. on different filesystems that os.walk will navigate the files in the same order. – Antonio Sep 26 '16 at 11:59
  • Is this recursive? – The Quantum Physicist Jan 02 '18 at 18:48
  • @TheQuantumPhysicist yes it is because [os.walk](https://docs.python.org/3/library/os.html#os.walk) is. – Peter van Heusden Mar 04 '18 at 06:04
  • Using Python 3.5.4, this solution did not work for me, until I had changed .hexdigest() to .digest() in the SHAhas.update line. Was I doing something wrong, or was this a mistake? – Samantha May 21 '19 at 11:10
  • I know it's old, but I had to downvote this since both try..except blocks in this are broken, apart from the mentioned problems like os.walk not being stable. – Jeronimo Jun 09 '21 at 15:02
  • I also know it's old, but I have to point out that writing `except:` is never a good idea. This will invoke the 'except' block, even if normal exit was invoked with sys.exit(0). – shayst Jul 14 '22 at 12:10
7

I'm not a fan of how the recipe referenced in the answer was written. I have a much simpler version that I'm using:

import hashlib
import os


def hash_directory(path):
    digest = hashlib.sha1()

    for root, dirs, files in os.walk(path):
        for names in files:
            file_path = os.path.join(root, names)

            # Hash the path and add to the digest to account for empty files/directories
            digest.update(hashlib.sha1(file_path[len(path):].encode()).digest())

            # Per @pt12lol - if the goal is uniqueness over repeatability, this is an alternative method using 'hash'
            # digest.update(str(hash(file_path[len(path):])).encode())

            if os.path.isfile(file_path):
                with open(file_path, 'rb') as f_obj:
                    while True:
                        buf = f_obj.read(1024 * 1024)
                        if not buf:
                            break
                        digest.update(buf)

    return digest.hexdigest()

I found exceptions were usually being thrown whenever something like an alias was encountered (shows up in the os.walk(), but you can't directly open it). The os.path.isfile() check takes care of those issues.

If there were to be an actual file within a directory I'm attempting to hash and it couldn't be opened, skipping that file and continuing is not a good solution. That affects the outcome of the hash. Better to kill the hash attempt altogether. Here, the try statement would be wrapped around the call to my hash_directory() function.

>>> try:
...   print(hash_directory('/tmp'))
... except:
...   print('Failed!')
... 
e2a075b113239c8a25c7e1e43f21e8f2f6762094
>>> 
Bryson Tyrrell
  • 385
  • 6
  • 12
  • Good stuff, but note this hash won't change if you add empty file or directory. – pt12lol Jun 01 '18 at 12:51
  • I added `digest.update(str(hash(file_path[len(path):])).encode())` for each `file_path` and `dir_path`. Hash ensures that this hash will be dependent on `PYTHONHASHSEED`, so faking this hash calculation will be really difficult. – pt12lol Jun 01 '18 at 13:00
  • @pt12lol I really like that idea - I hadn't considered that possibility - but I think using `hash` isn't the right solution. The result for the function will be different between Python 2 and Python 3 (I just tried this on my Mac). What about using `hashlib` again? ```python digest.update(hashlib.sha1(file_path.encode()).digest()) ``` – Bryson Tyrrell Jun 02 '18 at 01:56
  • No wonder that this hash differs between Python 2 and 3, because Python 3 introduces variable `PYTHONHASHSEED` that differs between every single Python run and `hash` relies on it. I suppose you are asserting directory hash with hardcoded value and `digest` is definitely better in this case. To be honest I don't have my code dependent on hardcoded values so I didn't care about it, only cared about uniqueness. In my case `hash` is even safer than `digest`. – pt12lol Jun 02 '18 at 06:33
  • Is there any particular reason for choosing 1024*1024 as the buffer size? – Darren Oct 21 '18 at 19:41
  • @Darren I carried this over from some code I wrote to hash very large files (in excess of several GBs). 1MB seemed a sane amount. – Bryson Tyrrell Oct 22 '18 at 20:15
3

I keep seeing this code propagated through various forums.

The ActiveState recipe answer works but, as Antonio pointed out, it is not guaranteed to be repeatable across filesystems, due to not being able to present the files in the same order (try it). One fix is to change

for root, dirs, files in os.walk(directory):
  for names in files:

to

for root, dirs, files in os.walk(directory):
  for names in sorted(files): 

(Yes I'm being lazy here. This sorts the filenames only and not the directories. The same principle applies)

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
omichael
  • 31
  • 1
2

use the checksumdir https://pypi.org/project/checksumdir/


directory  = '/path/to/directory/'
md5hash    = dirhash(directory, 'md5')
Eyal
  • 41
  • 3
1

I have optimized further on Andy's response.

The following is a python3 rather than python2 implementation. It uses SHA1, handles some cases where encoding is needed, is linted, and includes some doctrings.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""dir_hash: Return SHA1 hash of a directory.
- Copyright (c) 2009 Stephen Akiki, 2018 Joe Flack
- MIT License (http://www.opensource.org/licenses/mit-license.php)
- http://akiscode.com/articles/sha-1directoryhash.shtml
"""
import hashlib
import os


def update_hash(running_hash, filepath, encoding=''):
    """Update running SHA1 hash, factoring in hash of given file.

    Side Effects:
        running_hash.update()
    """
    if encoding:
        file = open(filepath, 'r', encoding=encoding)
        for line in file:
            hashed_line = hashlib.sha1(line.encode(encoding))
            hex_digest = hashed_line.hexdigest().encode(encoding)
            running_hash.update(hex_digest)
        file.close()
    else:
        file = open(filepath, 'rb')
        while True:
            # Read file in as little chunks.
            buffer = file.read(4096)
            if not buffer:
                break
            running_hash.update(hashlib.sha1(buffer).hexdigest())
        file.close()


def dir_hash(directory, verbose=False):
    """Return SHA1 hash of a directory.

    Args:
        directory (string): Path to a directory.
        verbose (bool): If True, prints progress updates.

    Raises:
        FileNotFoundError: If directory provided does not exist.

    Returns:
        string: SHA1 hash hexdigest of a directory.
    """
    sha_hash = hashlib.sha1()

    if not os.path.exists(directory):
        raise FileNotFoundError

    for root, dirs, files in os.walk(directory):
        for names in files:
            if verbose:
                print('Hashing', names)
            filepath = os.path.join(root, names)
            try:
                update_hash(running_hash=sha_hash,
                            filepath=filepath)
            except TypeError:
                update_hash(running_hash=sha_hash,
                            filepath=filepath,
                            encoding='utf-8')

    return sha_hash.hexdigest()
Joe Flack
  • 866
  • 8
  • 14