148

Surely there must be a way to do this easily!

I've tried the Linux command-line apps such as sha1sum and md5sum but they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.

I need to generate a single hash for the entire contents of a folder (not just the filenames).

I'd like to do something like

sha1sum /folder/of/stuff > singlehashvalue

Edit: to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.

kvantour
  • 25,269
  • 4
  • 47
  • 72
Ben L
  • 6,618
  • 8
  • 39
  • 34
  • 1
    By 'entire contents' do you mean the logical data of all files in the directory or its data along with meta while arriving at the root hash? Since the selection criteria of your use case is quite broad, I've tried to address few practical ones in my answer. – six-k Jan 10 '18 at 18:04
  • See also: [how do I check that two folders are the same in linux](https://stackoverflow.com/q/455061/4561887) – Gabriel Staples Apr 30 '22 at 19:10

20 Answers20

186

One possible way would be:

sha1sum path/to/folder/* | sha1sum

If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be

find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

And, finally, if you also need to take account of permissions and empty directories:

(find path/to/folder -type f -print0  | sort -z | xargs -0 sha1sum;
 find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
   xargs -0 stat -c '%n %a') \
| sha1sum

The arguments to stat will cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.

Nicholas Pipitone
  • 4,002
  • 4
  • 24
  • 39
Vatine
  • 20,782
  • 4
  • 54
  • 70
  • If you sort after the first sha1sum, then a LF in a filename should do no harm. – Rafał Dowgird Feb 13 '09 at 12:07
  • 1
    Edited. Sort can work on 0 delimited lists with the -z option. – Aaron Digulla Feb 13 '09 at 13:38
  • 3
    and don't forget to set LC_ALL=POSIX, so the various tools create locale independent output. – David Schmitt Feb 15 '09 at 12:28
  • 3
    I found cat | sha1sum to be considerably faster than sha1sum | sha1sum. YMMV, try each of these on your system: time find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum; time find path/to/folder -type f -print0 | sort -z | xargs -0 cat | sha1sum – Bruno Bronosky Apr 28 '11 at 17:02
  • 6
    @RichardBronosky - Let us assume we have two files, A and B. A contains "foo" and B contains "bar was here". With your method, we would not be able to separate that from two files C and D, where C contains "foobar" and D contains " was here". By hashing each file individually and then hash all "filename hash" pairs, we can see the difference. – Vatine Dec 18 '12 at 10:18
  • 2
    To make this work irrespective of the directory path (i.e. when you want to compare the hashes of two different folders), you need to use a relative path and change to the appropriate directory, because the paths are included in the final hash: `find ./folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum` – robbles Feb 14 '13 at 20:30
  • 3
    @robbles That is correct and why I did not put an initial `/` on the `path/to/folder` bit. – Vatine Feb 15 '13 at 10:58
  • You could also have your hashtool print out only the hashes, on FreeBSD for example: xargs -0 sha256 -q (Also, in your anwser, you might want to draw attention to the fact that (absolute) filenames are printed out with the hashes) – hopla Feb 25 '13 at 13:47
  • @hopla Relativified paths throughout instead of just in the final example. – Vatine Feb 25 '13 at 14:12
  • 1
    Much clearer :) I've also been think that using relative paths is better than the -q option, because then all the file names are taken into account in the final hash as well, avoiding problems should a hash collision ever occur. – hopla Feb 25 '13 at 19:17
  • @JasonS Define "large"? You're looking at roughly linear run-time in pure data volume (consumed by `sha1sum` or equivalent hashing). You're looking at (roughly) linear performance from `find`. Sorting is probably o(n log n), file "number of files" as n. Until growth in "log n" starts being significant, time will be dominated by the disk bandwidth. Waving a hand vagiuely in the air, I'd say you'd be OK for "tens to hundreds of thousands of files". At some point, the list of hashes-per-file to sort may require spilling to disk, so there's going to be a vicious cliff in the time complexity curve. – Vatine Oct 08 '15 at 08:41
  • no, I'm worried about the large command-line; xargs makes a single call to sha1sum, right? is there a limit in command-line size? – Jason S Oct 08 '15 at 13:13
  • 1
    @JasonS Ah, no, the reason for xargs is that it intelligently splits the incoming stream of "filenames to hash" from find into suitable chunks (defaults to, um, something low (basically, it depends on the system, but the default should always be safe). – Vatine Oct 08 '15 at 15:55
  • While this command looks to work well for a certain use case, it doesn't seem to include what may be relevant details such as directory names as well as file permissions. I'm sure there's more than one way to skin the cat though. – Binary Phile Dec 17 '15 at 19:52
  • @BinaryPhile That is correct, but not what the question originally asked for. All directories with contents will have their names as part of the final hash, though (they're part of the file names). It would be possible to include the permissions, but would require (some) thought, as a plain "ls -l" would include date and time information that is (probably) not relevant. – Vatine Dec 18 '15 at 10:29
  • So this doesn't capture the permissions? – CMCDragonkai Jan 19 '16 at 02:50
  • This also doesn't capture empty directories. – CMCDragonkai Jan 19 '16 at 03:25
  • @CMCDragonkai No, it only captures file contents, making sure to respect file boundaries. If you also want to include permissions and empty directories, it would be possible to add something like `find path/to/folder \( -type f -o -type d \) -print0 | sort -z | xargs stat -c "%n %a"`. Let me edit the question... – Vatine Jan 19 '16 at 09:10
  • 1
    To account for differences in sort algorithms between my Mac and RHEL 5.x server, I had to slightly modify the command: `find ./folder -type f -print0 | xargs -0 sha1sum | sort -df | sha1sum` – Mark Kreyman Jul 13 '16 at 19:13
  • Be careful with find. Running the script on `find /some/path/dir1 -type f ...` and `find /someother/path/dir2 -type f ...` will return different checksums even if the content of dir1 and dir2 is identical. You need to `cd /some/path/dir1` before calling `find . -type f ...` – Bernard Apr 15 '19 at 02:37
  • I'm having an issue where the xargs output, the list of hashes for my files, are not reliably coming out in the same order. Any idea why that might be happening? Could it be an issue with the sort command? – thinktt Jan 30 '20 at 19:31
  • @thinktt No obvious idea why. You could try replacing `xargs` with `echo` to check that the arguments are being passed through in a consistent order. Also remember that you (probably) want to ensure you're not using any localisation for sorting. – Vatine Jan 31 '20 at 11:15
  • This answer doesn't produce identical hashes for identical folders in different locations on your file system. That's a big short-coming. I explain why, and present a fix to it, as well as two bash functions I wrote: `sha256sum_dir` and `diff_dir`, in [my new answer here](https://stackoverflow.com/a/72073333/4561887). – Gabriel Staples May 01 '22 at 01:25
  • Use `shopt -s globstar`, so we can do it recursively: `sha1sum path/to/folder/** | sha1sum` – M Imam Pratama Jun 18 '22 at 17:04
  • And here's the same thing in a form of a bash function that you can immediately use: `hashfc() { (find $1 -type f -print0 | sort -z | xargs -0 sha1sum; find $1 \( -type f -o -type d \) -print0 | sort -z | xargs -0 stat -c '%n %a') | sha1sum; }` – Nullcaller Apr 28 '23 at 08:38
45
  • Use a file system intrusion detection tool like aide.

  • hash a tar ball of the directory:

    tar cvf - /path/to/folder | sha1sum

  • Code something yourself, like vatine's oneliner:

    find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

Michael Mior
  • 28,107
  • 9
  • 89
  • 113
David Schmitt
  • 58,259
  • 26
  • 121
  • 165
  • 7
    +1 for the tar solution. That is the fastest, but drop the v. verbosity only slows it down. – Bruno Bronosky Feb 05 '13 at 20:47
  • 9
    note that the tar soluition assumes the files are in the same order when you compare them. Whether they are would depend on the file system the files resides in when doing the comparison. – nos Feb 25 '13 at 14:19
  • 7
    The git hash is not suitable for this purpose since file contents are only a part of its input. Even for the initial commit of a branch, the hash is affected by the commit message and the commit metadata as well, like the time of the commit. If you commit the same directory structure multiple times, you will get different hash every time, thus the resulting hash is not suitable for determining whether two directories are exact copies of each other by only sending the hash over. – Zoltan May 17 '18 at 19:11
  • 1
    @Zoltan the git hash is perfectly fine, if you use a tree hash and not a commit hash. – hobbs May 30 '19 at 02:44
  • 1
    @hobbs The answer originally stated "commit hash", which is certainly not fit for this purpose. The tree hash sounds like a much better candidate, but there could still be hidden traps. One that comes to my mind is that having the executable bit set on some files changes the tree hash. You have to issue `git config --local core.fileMode false` before committing to avoid this. I don't know whether there are any more caveats like this. – Zoltan May 30 '19 at 07:45
  • 3
    @nos: With recent versions of GNU tar, sort order can be enforced with --sort=name. – Andrew Klaassen Dec 01 '20 at 11:40
  • Note that attributes like modification time are also part of the tar archive (among other variables), so simply copying the directory will give a different hash, even on the same machine. – haansn08 Mar 26 '23 at 20:38
21

If you just want to check if something in the folder changed, I'd recommend this one:

ls -alR --full-time /folder/of/stuff | sha1sum

It will just give you a hash of the ls output, that contains folders, sub-folders, their files, their timestamp, size and permissions. Pretty much everything that you would need to determine if something has changed.

Please note that this command will not generate hash for each file, but that is why it should be faster than using find.

Shumoapp
  • 1,489
  • 16
  • 15
  • 3
    I'm unsure why this doesn't have more upvotes given the simplicity of the solution. Can anyone explain why this wouldn't work well? – Dave C Mar 15 '17 at 01:02
  • 1
    I suppose this isn't ideal as the generated hash will be based on file owner, date-format setup, etc. – Ryota Mar 15 '17 at 22:06
  • 1
    The ls command can be customized to output whatever you want. You can replace -l with -gG to omit the group and the owner. And you can change the date format with the --time-style option. Basically check out the ls man page and see what suits your needs. – Shumoapp Mar 16 '17 at 15:52
  • @DaveC Because it's pretty much useless. If you want to compare filenames, just compare them directly. They're not that big. – Navin Aug 18 '18 at 01:51
  • 7
    @Navin From the question it is not clear whether it is necessary to hash file contents or detect a change in a tree. Each case has its uses. Storing 45K filenames in a kernel tree, for example, is less practical than a single hash. ls -lAgGR --block-size=1 --time-style=+%s | sha1sum works great for me – yashma Aug 21 '18 at 02:26
  • Note that even for the highly regarded rsync, comparing timestamps and file sizes is sufficient by default. – Torsten Bronger Jan 27 '22 at 15:29
18

So far the fastest way to do it is still with tar. And with several additional parameters we can also get rid of the difference caused by metadata.

To use GNU tar for hash the dir, one need to make sure you sort the path during tar, otherwise it is always different.

tar -C <root-dir> -cf - --sort=name <dir> | sha256sum

ignore time

If you do not care about the access time or modify time also use something like --mtime='UTC 2019-01-01' to make sure all timestamp is the same.

ignore ownership

Usually we need to add --group=0 --owner=0 --numeric-owner to unify the owner metadata.

ignore some files

use --exclude=PATTERN

ignore permissions

It is highly recommanded that you always compare the permissions.

If you really do not want to compare the permissions use:

--mode=777

This will fore all file permission to 777.

example:

$ echo a > test1/a.txt
$ echo b > test1/b.txt
$ tar -C ./ -cf - --sort=name test1 | sha256sum
e159ca984835cf4e1c9c7e939b7069d39b2fd2aa90460877f68f624458b1c95c  -
$ tar -C ./ -cf - --sort=name --mode=777 test1 | sha256sum
ef84fe411fb49bcf7967715b7854075004f1c7a7e4a57d2f3742afa4a54c40de  -
$ chmod 444 test1/a.txt
$ tar -C ./ -cf - --sort=name --mode=777 test1 | sha256sum
ef84fe411fb49bcf7967715b7854075004f1c7a7e4a57d2f3742afa4a54c40de  -
$ tar -C ./ -cf - --sort=name test1 | sha256sum
9b91430d954abb8a361b01de30f0995fb94a511c8fe1f7177ddcd475c85c65ff  -

it is known some tar does not have --sort, be sure you have GNU tar.

Wang
  • 7,250
  • 4
  • 35
  • 66
  • 2
    This is the best answer involving GNU tar, since it ensures that file contents and directory structure are consistently compared. – Andrew Klaassen Dec 01 '20 at 14:36
  • 1
    Warning: not all versions of tar have --sort :-( – krupan Apr 13 '21 at 17:32
  • This would be a brilliant method -- if only tar wouldn't also store permissions within the tar archive: if you have identical files/dirs with different permissions, such comparison would fail. – lvd Apr 29 '23 at 18:05
  • actually comparing the permission is required in most cases. You do not want anyone to mess up your permission settings. if you really do not want to check permission you can always use `--mode=777` @lvd – Wang May 03 '23 at 15:49
  • Thanks for a quick reaction @Wang! Now your method looks actually complete. – lvd May 06 '23 at 16:40
16

You can do tar -c /path/to/folder | sha1sum

davidtbernal
  • 13,434
  • 9
  • 44
  • 60
S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • 24
    If you want to replicate that checksum on a different machine, tar might not be a good choice, as the format seems to have room for ambiguity and exist in many versions, so the tar on another machine might produce different output from the same files. – slowdog Jan 27 '11 at 18:42
  • 3
    slowdog's valid concerns notwithstanding, if you care about file contents, permissions, etc. but not modification time, you can add the `--mtime` option like so: `tar -c /path/to/folder --mtime="1970-01-01" | sha1sum`. – Binary Phile Dec 17 '15 at 19:44
  • @S.Lott if the directory size is big, I mean if the size of the directory is so big, zipping it and getting md5 on it will take more time – Kasun Siyambalapitiya Jul 24 '17 at 09:38
7

A robust and clean approach

  • First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
  • Different approaches for different needs/purpose (all of the below or pick what ever applies):
    • Hash only the entry name of all entries in the directory tree
    • Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
    • For a symbolic link, its content is the referent name. Hash it or choose to skip
    • Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
    • If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
    • Handle large files well(again, mind the RAM)
    • Handle very deep directory trees (mind the open file descriptors)
    • Handle non standard file names
    • How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
    • Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.

This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.

Here's a tool, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.

An example usage and output of dtreetrawl.

Usage:
  dtreetrawl [OPTION...] "/trawl/me" [path2,...]

Help Options:
  -h, --help                Show help options

Application Options:
  -t, --terse               Produce a terse output; parsable.
  -j, --json                Output as JSON
  -d, --delim=:             Character or string delimiter/separator for terse output(default ':')
  -l, --max-level=N         Do not traverse tree beyond N level(s)
  --hash                    Enable hashing(default is MD5).
  -c, --checksum=md5        Valid hashing algorithms: md5, sha1, sha256, sha512.
  -R, --only-root-hash      Output only the root hash. Blank line if --hash is not set
  -N, --no-name-hash        Exclude path name while calculating the root checksum
  -F, --no-content-hash     Do not hash the contents of the file
  -s, --hash-symlink        Include symbolic links' referent name while calculating the root checksum
  -e, --hash-dirent         Include hash of directory entries while calculating root checksum

A snippet of human friendly output:

...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
        Base name                    : CREDITS
        Level                        : 1
        Type                         : regular file
        Referent name                :
        File size                    : 98443 bytes
        I-node number                : 290850
        No. directory entries        : 0
        Permission (octal)           : 0644
        Link count                   : 1
        Ownership                    : UID=0, GID=0
        Preferred I/O block size     : 4096 bytes
        Blocks allocated             : 200
        Last status change           : Tue, 21 Nov 17 21:28:18 +0530
        Last file access             : Thu, 28 Dec 17 00:53:27 +0530
        Last file modification       : Tue, 21 Nov 17 21:28:18 +0530
        Hash                         : 9f0312d130016d103aa5fc9d16a2437e

Stats for /home/lab/linux-4.14-rc8:
        Elapsed time     : 1.305767 s
        Start time       : Sun, 07 Jan 18 03:42:39 +0530
        Root hash        : 434e93111ad6f9335bb4954bc8f4eca4
        Hash type        : md5
        Depth            : 8
        Total,
                size           : 66850916 bytes
                entries        : 12484
                directories    : 763
                regular files  : 11715
                symlinks       : 6
                block devices  : 0
                char devices   : 0
                sockets        : 0
                FIFOs/pipes    : 0
six-k
  • 388
  • 5
  • 10
  • 1
    Can you give a brief example to get a robust and clean sha256 of a folder, maybe for a Windows folder with three subdirectories and a few files in there each? – Ferit May 10 '20 at 00:44
7

If this is a git repo and you want to ignore any files in .gitignore, you might want to use this:

git ls-files <your_directory> | xargs sha256sum | cut -d" " -f1 | sha256sum | cut -d" " -f1

This is working well for me.

ndbroadbent
  • 13,513
  • 3
  • 56
  • 85
4

Another tool to achieve this:

http://md5deep.sourceforge.net/

As is sounds: like md5sum but also recursive, plus other features.

md5deep -r {direcotory}

it3xl
  • 2,372
  • 27
  • 37
Jack
  • 49
  • 2
4

If you just want to hash the contents of the files, ignoring the filenames then you can use

cat $FILES | md5sum

Make sure you have the files in the same order when computing the hash:

cat $(echo $FILES | sort) | md5sum

But you can't have directories in your list of files.

  • 3
    Moving the end of one file into the beginning of the file that follows it alphabetically would not affect the hash but should. A file-delimiter or file lengths would need to be included in the hash. – Jason Stangroome Mar 12 '12 at 03:35
3

You can try hashdir which is an open source command line tool written for this purpose.

hashdir /folder/of/stuff

It has several useful flags to allow you to specify the hashing algorithm, print the hashes of all children, as well as save and verify a hash.

hashdir:
  A command-line utility to checksum directories and files.

Usage:
  hashdir [options] [<item>...] [command]

Arguments:
  <item>    Directory or file to hash/check

Options:
  -t, --tree                                         Print directory tree
  -s, --save                                         Save the checksum to a file
  -i, --include-hidden-files                         Include hidden files
  -e, --skip-empty-dir                               Skip empty directories
  -a, --algorithm <md5|sha1|sha256|sha384|sha512>    The hash function to use [default: sha1]
  --version                                          Show version information
  -?, -h, --help                                     Show help and usage information

Commands:
  check <item>    Verify that the specified hash file is valid.
Anu Bandi
  • 61
  • 6
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/31063800) – Ejdrien Feb 21 '22 at 01:08
2

There is a python script for that:

http://code.activestate.com/recipes/576973-getting-the-sha-1-or-md5-hash-of-a-directory/

If you change the names of a file without changing their alphabetical order, the hash script will not detect it. But, if you change the order of the files or the contents of any file, running the script will give you a different hash than before.

Kingdon
  • 308
  • 4
  • 12
2

I had to check into a whole directory for file changes.

But with excluding, timestamps, directory ownerships.

Goal is to get a sum identical anywhere, if the files are identical.

Including hosted into other machines, regardless anything but the files, or a change into them.

md5sum * | md5sum | cut -d' ' -f1

It generate a list of hash by file, then concatenate those hashes into one.

This is way faster than the tar method.

For a stronger privacy in our hashes, we can use sha512sum on the same recipe.

sha512sum * | sha512sum | cut -d' ' -f1

The hashes are also identicals anywhere using sha512sum but there is no known way to reverse it.

NVRM
  • 11,480
  • 1
  • 88
  • 87
  • This seems much simpler than the accepted answer for hashing a directory. I wasn't finding the accepted answer reliable. One issue... is there a chance the hashes could come out in a different order? `sha256sum /tmp/thd-agent/* | sort` is what i'm trying for a reliable ordering, then just hashing that. – thinktt Jan 30 '20 at 19:57
  • Hi, looks like the hashes comes in alphabetical order by default. What do you mean by reliable ordering? You have to organize all that by yourself. For example using associative arrays, entry + hash. Then you sort this array by entry, this gives a list of computed hashes in the sort order. I believe you can use a json object otherwise, and hash the whole object directly. – NVRM Jan 31 '20 at 01:27
  • If I understand you're saying it hashes the files in alphabetical order. That seems right. Something in the accepted answer above was giving me intermittent different orders sometimes, so I'm just trying to make sure that doesn't happen again. I'm going to stick with putting sort at the end. Seems to be working. Only issue with this method vs accepted answer I see is it doesn't deal with nested folders. In my case I don't have any folders so this works great. – thinktt Jan 31 '20 at 17:23
  • what about `ls -r | sha256sum` ? – NVRM Jan 31 '20 at 22:27
  • @NVRM tried it and it just checked for file name changes, not the file content – Gi0rgi0s Aug 14 '20 at 15:32
2

Here's a simple, short variant in Python 3 that works fine for small-sized files (e.g. a source tree or something, where every file individually can fit into RAM easily), ignoring empty directories, based on the ideas from the other solutions:

import os, hashlib

def hash_for_directory(path, hashfunc=hashlib.sha1):                                                                                            
    filenames = sorted(os.path.join(dp, fn) for dp, _, fns in os.walk(path) for fn in fns)         
    index = '\n'.join('{}={}'.format(os.path.relpath(fn, path), hashfunc(open(fn, 'rb').read()).hexdigest()) for fn in filenames)               
    return hashfunc(index.encode('utf-8')).hexdigest()                          

It works like this:

  1. Find all files in the directory recursively and sort them by name
  2. Calculate the hash (default: SHA-1) of every file (reads whole file into memory)
  3. Make a textual index with "filename=hash" lines
  4. Encode that index back into a UTF-8 byte string and hash that

You can pass in a different hash function as second parameter if SHA-1 is not your cup of tea.

Thomas Perl
  • 2,178
  • 23
  • 20
2

adding multiprocessing and progressbar to kvantour's answer

Around 30x faster (depending on CPU)

100%|██████████████████████████████████| 31378/31378 [03:03<00:00, 171.43file/s]
# to hash without permissions
find . -type f -print0 | sort -z | xargs -P $(nproc --all) -0 sha1sum | tqdm --unit file --total $(find . -type f | wc -l) | sort | awk '{ print $1 }' | sha1sum
# to hash permissions
(find . -type f -print0  | sort -z | xargs -P $(nproc --all) -0 sha1sum | sort | awk '{ print $1 }'; 
  find . \( -type f -o -type d \) -print0 | sort -z | xargs -P $(nproc --all) -0 stat -c '%n %a') | \
  sort | sha1sum | awk '{ print $1 }'

make sure tqdm is installed, pip install tqdm or check documentation

awk will remove the filepath so that if the parent directory or path is different it wouldn't affect the hash

FarisHijazi
  • 558
  • 4
  • 9
  • 1
    this needs a | sort before the last sha1sum to get consistent results (unless tqdm takes care of that? I didn't test with tqdm) – krupan Apr 13 '21 at 17:44
  • that's correct I just added that without seeing your commend, and now I wish I saw yours before. – FarisHijazi Oct 25 '21 at 17:17
2

Quick summary: how to hash the contents of an entire folder, or compare two folders for equality

# 1. How to get a sha256 hash over all file contents in a folder, including
# hashing over the relative file paths within that folder to check the
# filenames themselves (get this bash function below).
sha256sum_dir "path/to/folder"

# 2. How to quickly compare two folders (get the `diff_dir` bash function below)
diff_dir "path/to/folder1" "path/to/folder2"
# OR:
diff -r -q "path/to/folder1" "path/to/folder2"

The "one liners"

Do this instead of the main answer, to get a single hash for all non-directory file contents within an entire folder, no matter where the folder is located:

This is a "1-line" command. Copy and paste the whole thing to run it all at once:

# This one works, but don't use it, because its hash output does NOT
# match that of my `sha256sum_dir` function. I recommend you use
# the "1-liner" just below, therefore, instead.

time ( \
    starting_dir="$(pwd)" \
    && target_dir="path/to/folder" \
    && cd "$target_dir" \
    && find . -not -type d -print0 | sort -zV \
    | xargs -0 sha256sum | sha256sum; \
    cd "$starting_dir"
)

However, that produces a slightly different hash than my sha256sum_dir bash function, which I present below, produces. So, to get the output hash to exactly match the output from my sha256sum_dir function, do this instead:

# Use this one, as its output matches that of my `sha256sum_dir`
# function exactly.

all_hashes_str="$( \
    starting_dir="$(pwd)" \
    && target_dir="path/to/folder" \
    && cd "$target_dir" \
    && find . -not -type d -print0 | sort -zV | xargs -0 sha256sum \
    )"; \
    cd "$starting_dir"; \
    printf "%s" "$all_hashes_str" | sha256sum

For more on why the main answer doesn't produce identical hashes for identical folders in different locations, see further below.

[My preferred method] Here are some bash functions I wrote: sha256sum_dir and diff_dir

Place the following functions in your ~/.bashrc file or in your ~/.bash_aliases file, assuming your ~/.bashrc file sources the ~/.bash_aliases file like this:

if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi

You can find both of the functions below in my personal ~/.bash_aliases file in my eRCaGuy_dotfiles repo.

Here is the sha256sum_dir function, which obtains a total "directory" hash of all files in the directory:

# Take the sha256sum of all files in an entire dir, and then sha256sum that
# entire output to obtain a _single_ sha256sum which represents the _entire_
# dir.
# See:
# 1. [my answer] https://stackoverflow.com/a/72070772/4561887
sha256sum_dir() {
    return_code="$RETURN_CODE_SUCCESS"
    if [ "$#" -eq 0 ]; then
        echo "ERROR: too few arguments."
        return_code="$RETURN_CODE_ERROR"
    fi
    # Print help string if requested
    if [ "$#" -eq 0 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
        # Help string
        echo "Obtain a sha256sum of all files in a directory."
        echo "Usage:  ${FUNCNAME[0]} [-h|--help] <dir>"
        return "$return_code"
    fi

    starting_dir="$(pwd)"
    target_dir="$1"
    cd "$target_dir"

    # See my answer: https://stackoverflow.com/a/72070772/4561887
    filenames="$(find . -not -type d | sort -V)"
    IFS=$'\n' read -r -d '' -a filenames_array <<< "$filenames"
    time all_hashes_str="$(sha256sum "${filenames_array[@]}")"
    cd "$starting_dir"

    echo ""
    echo "Note: you may now call:"
    echo "1. 'printf \"%s\n\" \"\$all_hashes_str\"' to view the individual" \
         "hashes of each file in the dir. Or:"
    echo "2. 'printf \"%s\" \"\$all_hashes_str\" | sha256sum' to see that" \
         "the hash of that output is what we are using as the final hash" \
         "for the entire dir."
    echo ""
    printf "%s" "$all_hashes_str" | sha256sum | awk '{ print $1 }'
    return "$?"
}
# Note: I prefix this with my initials to find my custom functions easier
alias gs_sha256sum_dir="sha256sum_dir"

Assuming you just want to compare two directories for equality, you can use diff -r -q "dir1" "dir2" instead, which I wrapped in this diff_dir command. I learned about the diff command to compare entire folders here: how do I check that two folders are the same in linux.

# Compare dir1 against dir2 to see if they are equal or if they differ.
# See:
# 1. How to `diff` two dirs: https://stackoverflow.com/a/16404554/4561887
diff_dir() {
    return_code="$RETURN_CODE_SUCCESS"
    if [ "$#" -eq 0 ]; then
        echo "ERROR: too few arguments."
        return_code="$RETURN_CODE_ERROR"
    fi
    # Print help string if requested
    if [ "$#" -eq 0 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
        echo "Compare (diff) two directories to see if dir1 contains the same" \
             "content as dir2."
        echo "NB: the output will be **empty** if both directories match!"
        echo "Usage:  ${FUNCNAME[0]} [-h|--help] <dir1> <dir2>"
        return "$return_code"
    fi

    dir1="$1"
    dir2="$2"
    time diff -r -q "$dir1" "$dir2"
    return_code="$?"
    if [ "$return_code" -eq 0 ]; then
        echo -e "\nDirectories match!"
    fi

    # echo "$return_code"
    return "$return_code"
}
# Note: I prefix this with my initials to find my custom functions easier
alias gs_diff_dir="diff_dir"

Here is the output of my sha256sum_dir command on my ~/temp2 dir (which dir I describe just below so you can reproduce it and test this yourself). You can see the total folder hash is b86c66bcf2b033f65451e8c225425f315e618be961351992b7c7681c3822f6a3 in this case:

$ gs_sha256sum_dir ~/temp2

real    0m0.007s
user    0m0.000s
sys 0m0.007s

Note: you may now call:
1. 'printf "%s\n" "$all_hashes_str"' to view the individual hashes of each 
file in the dir. Or:
2. 'printf "%s" "$all_hashes_str" | sha256sum' to see that the hash of that 
output is what we are using as the final hash for the entire dir.

b86c66bcf2b033f65451e8c225425f315e618be961351992b7c7681c3822f6a3

Here is the cmd and output of diff_dir to compare two dirs for equality. This is checking that copying an entire directory to my SD card just now worked correctly. I made the output indicate Directories match! whenever that is the case!:

$ gs_diff_dir "path/to/sd/card/tempdir" "/home/gabriel/tempdir"

real    0m0.113s
user    0m0.037s
sys 0m0.077s

Directories match!

Why the main answer doesn't produce identical hashes for identical folders in different locations

I tried the most-upvoted answer here, and it doesn't work quite right as-is. It needs a little tweaking. It doesn't work quite right because the hash changes based on the folder-of-interest's base path! That means that an identical copy of some folder will have a different hash than the folder it was copied from even if the two folders are perfect matches and contain exactly the same content! That kind of defeats the purpose of taking a hash of the folder if the hashes of two identical folders differ! Let me explain:

Assume I have a folder named temp2 at ~/temp2. It contains file1.txt, file2.txt, and file3.txt. file1.txt contains the letter a followed by a return, file2.txt contains a letter b followed by a return, and file3.txt contains a letter c followed by a return.

If I run find /home/gabriel/temp2, I get:

$ find /home/gabriel/temp2
/home/gabriel/temp2
/home/gabriel/temp2/file3.txt
/home/gabriel/temp2/file1.txt
/home/gabriel/temp2/file2.txt

If I forward that to sha256sum (in place of sha1sum) in the same pattern as the main answer states, I get this. Notice it has the full path after each hash, which is not what we want:

$ find /home/gabriel/temp2 -type f -print0 | sort -z | xargs -0 sha256sum
87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7  /home/gabriel/temp2/file1.txt
0263829989b6fd954f72baaf2fc64bc2e2f01d692d4de72986ea808f6e99813f  /home/gabriel/temp2/file2.txt
a3a5e715f0cc574a73c3f9bebb6bc24f32ffd5b67b387244c2c909da779a1478  /home/gabriel/temp2/file3.txt

If you then pipe that output string above to sha256sum again, it hashes the file hashes with their full file paths, which is not what we want! The file hashes may match in a folder and in a copy of that folder exactly, but the absolute paths do NOT match exactly, so they will produce different final hashes since we are hashing over the full file paths as part of our single, final hash!

Instead, what we want is the relative file path next to each hash. To do that, you must first cd into the folder of interest, and then run the hash command over all files therein, like this:

cd "/home/gabriel/temp2" && find . -type f -print0 | sort -z | xargs -0 sha256sum

Now, I get this. Notice the file paths are all relative now, which is what I want!:

$ cd "/home/gabriel/temp2" && find . -type f -print0 | sort -z | xargs -0 sha256sum
87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7  ./file1.txt
0263829989b6fd954f72baaf2fc64bc2e2f01d692d4de72986ea808f6e99813f  ./file2.txt
a3a5e715f0cc574a73c3f9bebb6bc24f32ffd5b67b387244c2c909da779a1478  ./file3.txt

Good. Now, if I hash that entire output string, since the file paths are all relative in it, the final hash will match exactly for a folder and its copy! In this way, we are hashing over the file contents and the file names within the directory of interest, to get a different hash for a given folder if either the file contents are different or the filenames are different, or both.

Gabriel Staples
  • 36,492
  • 15
  • 194
  • 265
1

I've written a Groovy script to do this:

import java.security.MessageDigest

public static String generateDigest(File file, String digest, int paddedLength){
    MessageDigest md = MessageDigest.getInstance(digest)
    md.reset()
    def files = []
    def directories = []

    if(file.isDirectory()){
        file.eachFileRecurse(){sf ->
            if(sf.isFile()){
                files.add(sf)
            }
            else{
                directories.add(file.toURI().relativize(sf.toURI()).toString())
            }
        }
    }
    else if(file.isFile()){
        files.add(file)
    }

    files.sort({a, b -> return a.getAbsolutePath() <=> b.getAbsolutePath()})
    directories.sort()

    files.each(){f ->
        println file.toURI().relativize(f.toURI()).toString()
        f.withInputStream(){is ->
            byte[] buffer = new byte[8192]
            int read = 0
            while((read = is.read(buffer)) > 0){
                md.update(buffer, 0, read)
            }
        }
    }

    directories.each(){d ->
        println d
        md.update(d.getBytes())
    }

    byte[] digestBytes = md.digest()
    BigInteger bigInt = new BigInteger(1, digestBytes)
    return bigInt.toString(16).padLeft(paddedLength, '0')
}

println "\n${generateDigest(new File(args[0]), 'SHA-256', 64)}"

You can customize the usage to avoid printing each file, change the message digest, take out directory hashing, etc. I've tested it against the NIST test data and it works as expected. http://www.nsrl.nist.gov/testdata/

gary-macbook:Scripts garypaduana$ groovy dirHash.groovy /Users/garypaduana/.config
.DS_Store
configstore/bower-github.yml
configstore/insight-bower.json
configstore/update-notifier-bower.json
filezilla/filezilla.xml
filezilla/layout.xml
filezilla/lockfile
filezilla/queue.sqlite3
filezilla/recentservers.xml
filezilla/sitemanager.xml
gtk-2.0/gtkfilechooser.ini
a/
configstore/
filezilla/
gtk-2.0/
lftp/
menus/
menus/applications-merged/

79de5e583734ca40ff651a3d9a54d106b52e94f1f8c2cd7133ca3bbddc0c6758
haventchecked
  • 1,916
  • 1
  • 21
  • 24
1

Try to make it in two steps:

  1. create a file with hashes for all files in a folder
  2. hash this file

Like so:

# for FILE in `find /folder/of/stuff -type f | sort`; do sha1sum $FILE >> hashes; done
# sha1sum hashes

Or do it all at once:

# cat `find /folder/of/stuff -type f | sort` | sha1sum
Joao da Silva
  • 7,353
  • 2
  • 28
  • 24
  • `for F in 'find ...' ...` doesn't work when you have spaces in names (which you always do nowadays). – mivk Apr 10 '12 at 10:38
1

I would pipe the results for individual files through sort (to prevent a mere reordering of files to change the hash) into md5sum or sha1sum, whichever you choose.

Rafał Dowgird
  • 43,216
  • 11
  • 77
  • 90
0

You could sha1sum to generate the list of hash values and then sha1sum that list again, it depends on what exactly it is you want to accomplish.

Ronny Vindenes
  • 2,361
  • 1
  • 18
  • 15
0

How to hash all files in an entire directory, including the filenames as well as their contents

Assuming you are trying to compare a folder and all its contents to ensure it was copied correctly from one computer to another, for instance, you can do it as follows. Let's assume the folder is named mydir and is at path /home/gabriel/mydir on computer 1, and at /home/gabriel/dev/repos/mydir on computer 2.

# 1. First, cd to the dir in which the dir of interest is found. This is
# important! If you don't do this, then the paths output by find will differ
# between the two computers since the absolute paths to `mydir` differ. We are
# going to hash the paths too, not just the file contents, so this matters. 
cd /home/gabriel            # on computer 1
cd /home/gabriel/dev/repos  # on computer 2

# 2. hash all files inside `mydir`, then hash the list of all hashes and their
# respective file paths. This obtains one single final hash. Sorting is
# necessary by piping to `sort` to ensure we get a consistent file order in
# order to ensure a consistent final hash result.
find mydir -type f -exec sha256sum {} + | sort | sha256sum

# Optionally pipe that output to awk to filter in on just the hash (first field
# in the output)
find mydir -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'

That's it!

To see the intermediary list of file hashes, for learning's sake, just run this:

find mydir -type f -exec sha256sum {} + | sort

Note that the above commands ignore empty directories, file permissions, timestamps of when files were last edited, etc. For most cases though that's ok.

Example

Here is a real run and actual output. I wanted to ensure my eclipse-workspace folder was properly copied from one computer to another. As you can see, the time command tells me it took 11.790 seconds:

$ time find eclipse-workspace -type f -exec sha256sum {} + | sort | sha256sum
8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4  -

real    0m11.790s
user    0m11.372s
sys 0m0.432s

The hash I care about is: 8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4

If piping to awk and excluding time, I get:

$ find eclipse-workspace -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4

Be sure you check find for errors in the printed stderr output, as a hash will be produced even in the event find fails.

Hashing my whole eclipse-workspace dir in only 12 seconds is impressive considering it contains 6480 files, as shown by this:

find eclipse-workspace -type f | wc -l

...and is 3.6 GB in size, as shown by this:

du -sh eclipse-workspace

See also

  1. My other answer here, where I use the above info.: how do I check that two folders are the same in linux

Other credit: I had a chat with ChatGPT to learn some of the pieces above. All work and text above, however, was written by me, tested by me, and verified by me.

Gabriel Staples
  • 36,492
  • 15
  • 194
  • 265