151

I need to calculate a summary MD5 checksum for all files of a particular type (*.py for example) placed under a directory and all sub-directories.

What is the best way to do that?


The proposed solutions are very nice, but this is not exactly what I need. I'm looking for a solution to get a single summary checksum which will uniquely identify the directory as a whole - including content of all its subdirectories.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
victorz
  • 2,209
  • 4
  • 19
  • 15
  • 3
    Seems like a superuser question to me. – Noldorin Nov 01 '09 at 14:52
  • 12
    Note that checksums don't *uniquely* identify anything. – Hosam Aly Nov 01 '09 at 15:58
  • 1
    Why would you have two directory trees that may or may not be "the same" that you want to uniquely identify? Does file create/modify/access time matter? Is version control what you really need? – jmucchiello Nov 01 '09 at 22:36
  • What is really matter in my case is similarity of the whole directory tree content which means AFAIK the following: 1) content of any file under the directory tree has not been changed 2) no new file was added to the directory tree 3) no file was deleted – victorz Nov 03 '09 at 11:18
  • Related post: [How do I get the MD5 sum of a directory's contents as one sum?](https://unix.stackexchange.com/questions/35832) – zx8754 Oct 12 '17 at 09:28
  • Take a look at [this](http://stahlforce.com/dev/index.php?tool=md5list) and [this](http://osdir.com/ml/security.basics/2002-06/msg00599.html) for a more detailed explanation. – luvieere Nov 01 '09 at 14:06

16 Answers16

171

Create a tar archive file on the fly and pipe that to md5sum:

tar c dir | md5sum

This produces a single MD5 hash value that should be unique to your file and sub-directory setup. No files are created on disk.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ire_and_curses
  • 68,372
  • 23
  • 116
  • 141
  • 25
    @CharlesB with a single check-sum you never know which file is different. The question was about a single check-sum for a directory. – Hawken Jul 01 '12 at 16:11
  • 24
    `ls -alR dir | md5sum` . This is even better no compression just a read. It is unique because the content contains the mod time and size of file ;) – Sid Oct 03 '12 at 01:14
  • @PuttySidDahari Nice solution. Its not truly finding all differences since its not computing md5s on the files themselves...but very fast and definitely 'good enough' for me! – dsummersl Oct 19 '12 at 13:07
  • Doesn't the tar utility create a huge amount of cpu overhead because of the compression ?! I think the above accepted answer is more efficient... – Motsel Nov 07 '12 at 09:57
  • 17
    @Daps0l - there is no compression in my command. You need to add `z` for gzip, or `j` for bzip2. I've done neither. – ire_and_curses Nov 07 '12 at 18:19
  • 10
    Take care that doing this would integrate the timestamp of the files and other stuff in the checksum computation, not only the content of the files – Michael Zilbermann May 16 '13 at 08:03
  • 3
    It didn't work for me, I think mainly because I copied files to external HDD, so their metadata attrs changed, and tar packs it also. Maybe tar has some options to skip metadata. – Konstantine Rybnikov Sep 20 '13 at 10:37
  • 16
    This is cute, but it doesn't really work. There's no guarantee that `tar`ing the same set of files twice, or on two different computers, will yield the same exact result. – fletom Apr 08 '14 at 22:43
  • 6
    The problem is that the other directory you're comparing to could be on another machine with another filesystem, and tar has no guarantee about the ordering of how it bundles files. So you can have all the files individually have the correct checksum, but the tar|md5 computation will differ. – brak2718 Apr 18 '14 at 02:57
  • Creating tar may differ for the folder due to its timestamp each time its generated and it won't be unique – alper Dec 18 '20 at 00:40
  • Have fun with 10TB folder – Boris Ivanov May 28 '23 at 12:34
165
find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | awk '{print $1}' | sort | md5sum

The find command lists all the files that end in .py. The MD5 hash value is computed for each .py file. AWK is used to pick off the MD5 hash values (ignoring the filenames, which may not be unique). The MD5 hash values are sorted. The MD5 hash value of this sorted list is then returned.

I've tested this by copying a test directory:

rsync -a ~/pybin/ ~/pybin2/

I renamed some of the files in ~/pybin2.

The find...md5sum command returns the same output for both directories.

2bcf49a4d19ef9abd284311108d626f1  -

To take into account the file layout (paths), so the checksum changes if a file is renamed or moved, the command can be simplified:

find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | md5sum

On macOS with md5:

find /path/to/dir/ -type f -name "*.py" -exec md5 {} + | md5
Eneko Alonso
  • 18,884
  • 9
  • 62
  • 84
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 26
    Note that the same checksum will be generated if a file gets renamed. So this doesn't truly fit a "checksum which will uniquely identify the directory as a whole" if you consider file layout part of the signature. – Valentin Milea Jan 27 '12 at 13:46
  • 1
    you could slightly change the command-line to prefix each file checksum with the name of the file (or even better, the relative path of the file from /path/to/dir/) so it is taken into account in the final checksum. – Michael Zilbermann May 16 '13 at 08:02
  • 4
    @zim2001: Yes, it could be altered, but as I understood the problem (especially due to the OP's comment under the question), the OP wanted any two directories to be considered equal if the *contents* of the files were identical regardless of filename or even relative path. – unutbu May 16 '13 at 09:31
  • @unutbu : I know; i was reacting to the previous note, from Valentin Milea. – Michael Zilbermann May 22 '13 at 09:02
  • @ValentinMilea just remove the `awk ...` part if you consider layout part of signature. – segfault Jul 01 '13 at 21:17
  • Is there a syntax error in your answer? I had to enclose the -name pattern in single quotes in order to get it to work. – silvernightstar Oct 11 '13 at 11:18
  • @silvernightstar: For me (on Ubuntu/bash) it works either way but you are right; I probably should put quotes around it. – unutbu Oct 11 '13 at 12:40
  • @unutbu Without the quotes it expands *.py and thus breaks if any .py files are in the current directory that you are running the command from, that's why it needs to/should always be quoted. – Adrian Frühwirth Oct 11 '13 at 12:58
  • Sorry I know this is a year old, but why sort the initial md5sums? – user3388884 Jun 27 '14 at 19:33
  • @user3388884: Consider two different machines with the same directory contents, but say, different names. The files may be listed in a different order. So the md5sums will be generated in a different order. We want to consider the two directories as equivalent, so we must come up with a canonical order for the md5sums before we hash the md5sums. – unutbu Jun 27 '14 at 19:36
  • ok i see... but you're sorting the md5sums, not the files... but i guess it wouldn't matter actually – user3388884 Jun 30 '14 at 16:41
  • My version of this command is: `find /path/to/dir/ -type f -exec md5sum {} + | sort | md5sum | cut -c1-32`. This take into account individual hashes and their paths. Checksum won't change, unless file will be renamed, removed or modified. Just always use the same path. – Krzysiek Apr 30 '15 at 07:30
  • Best answer can be found : https://unix.stackexchange.com/questions/35832/how-do-i-get-the-md5-sum-of-a-directorys-contents-as-one-sum – MUY Belgium Oct 16 '17 at 13:58
  • @ValentinMilea What change can modify the checksum of a directory? – ado sar Jul 05 '23 at 22:19
53

ire_and_curses's suggestion of using tar c <dir> has some issues:

  • tar processes directory entries in the order which they are stored in the filesystem, and there is no way to change this order. This effectively can yield completely different results if you have the "same" directory on different places, and I know no way to fix this (tar cannot "sort" its input files in a particular order).
  • I usually care about whether groupid and ownerid numbers are the same, not necessarily whether the string representation of group/owner are the same. This is in line with what for example rsync -a --delete does: it synchronizes virtually everything (minus xattrs and acls), but it will sync owner and group based on their ID, not on string representation. So if you synced to a different system that doesn't necessarily have the same users/groups, you should add the --numeric-owner flag to tar
  • tar will include the filename of the directory you're checking itself, just something to be aware of.

As long as there is no fix for the first problem (or unless you're sure it does not affect you), I would not use this approach.

The proposed find-based solutions are also no good because they only include files, not directories, which becomes an issue if you the checksumming should keep in mind empty directories.

Finally, most suggested solutions don't sort consistently, because the collation might be different across systems.

This is the solution I came up with:

dir=<mydir>; (find "$dir" -type f -exec md5sum {} +; find "$dir" -type d) | LC_ALL=C sort | md5sum

Notes about this solution:

  • The LC_ALL=C is to ensure reliable sorting order across systems
  • This doesn't differentiate between a directory "named\nwithanewline" and two directories "named" and "withanewline", but the chance of that occurring seems very unlikely. One usually fixes this with a -print0 flag for find, but since there's other stuff going on here, I can only see solutions that would make the command more complicated than it's worth.

PS: one of my systems uses a limited busybox find which does not support -exec nor -print0 flags, and also it appends '/' to denote directories, while findutils find doesn't seem to, so for this machine I need to run:

dir=<mydir>; (find "$dir" -type f | while read f; do md5sum "$f"; done; find "$dir" -type d | sed 's#/$##') | LC_ALL=C sort | md5sum

Luckily, I have no files/directories with newlines in their names, so this is not an issue on that system.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Dieter_be
  • 853
  • 7
  • 13
  • 1
    +1: Very interesting! Are you saying that the order might differ between different filesystem types, or within the same filesystem? – ire_and_curses Nov 01 '11 at 17:10
  • 2
    both. it just depends on the order of the directory entries within each directory. AFAIK directory entries (in the filesystem) are just created in the order in which you "create files in the directory". A simple example: $ mkdir a; touch a/file-1; touch a/file-2 $ mkdir b; touch b/file-2; touch b/file-1 $ (cd a; tar -c . | md5sum) fb29e7af140aeea5a2647974f7cdec77 - $ (cd b; tar -c . | md5sum) a3a39358158a87059b9f111ccffa1023 - – Dieter_be Nov 14 '11 at 13:01
  • I'd rather replace the while-stuff with a plain xargs so -P for parallel processing is possible. This also requires an additional sort step for the second column because parallel md5sum is with no repeatable order. `find "$dir" -type f -print0 | xargs -P 6 -r0 md5sum | sort -k2` – motzmann Nov 15 '20 at 13:32
18

If you only care about files and not empty directories, this works nicely:

find /path -type f | sort -u | xargs cat | md5sum
tesujimath
  • 321
  • 3
  • 3
  • Why is `cat` required? Will it work for files with spaces in their name? – Peter Mortensen Nov 06 '21 at 23:25
  • OK, *tesujimath* seems to have left the building (*"Last seen more than 2 years ago"*). Perhaps somebody else can chime it? – Peter Mortensen Nov 06 '21 at 23:26
  • 2
    If you don't `cat` the files the input to `md5sum` will be the output of `find` which is a list of file names (and paths) **not** the content of those files. – Abid H. Mujtaba Nov 18 '21 at 02:07
  • Note: I thought about omitting `sort -u` but we need it because otherwise order of the files can be different therefore the checksum as well. – Putnik Jan 20 '22 at 07:50
  • Use `find /path -type f -print0 | sort -u -z | xargs --null cat | md5sum` to also handle filenames with spaces. – Victor Klos Aug 09 '23 at 12:05
11

A solution which worked best for me:

find "$path" -type f -print0 | sort -z | xargs -r0 md5sum | md5sum

Reason why it worked best for me:

  1. handles file names containing spaces
  2. Ignores filesystem meta-data
  3. Detects if file has been renamed

Issues with other answers:

Filesystem meta-data is not ignored for:

tar c - "$path" | md5sum

Does not handle file names containing spaces nor detects if file has been renamed:

find /path -type f | sort -u | xargs cat | md5sum
Tiago Lopo
  • 7,619
  • 1
  • 30
  • 51
  • What's the `-r0` option to `xargs`? I see a `-r` option for not running the command if the input contains no non-blanks, but what's with the `0`? – Stack Underflow Feb 01 '22 at 17:41
  • In case the path contains spaces, see the `-print0` for find. we also have `-0` for xargs and `-z` for sort, basically it will replace spaces by null characters. – Tiago Lopo Feb 02 '22 at 12:37
  • 1
    Oh, of course! You are combing the two different options `-r` and `-0`. I was thinking a single `-r0` option. Thanks. – Stack Underflow Feb 02 '22 at 19:16
  • No `-print0`, `-z` and `--null` change line endings in NULLs, in order to leave spaces intact. – Victor Klos Aug 09 '23 at 12:06
  • hi @VictorKlos, we never open the file, we pass the filename to md5sum – Tiago Lopo Aug 09 '23 at 13:40
10

For the sake of completeness, there's md5deep(1); it's not directly applicable due to *.py filter requirement but should do fine together with find(1).

Michael Shigorin
  • 982
  • 10
  • 11
  • What parameters would I use if I only wanted to calculate the md5 checksum of a directory? – Gabriel Fair Feb 25 '17 at 19:20
  • 1
    What is it supposed to do? Can you [elaborate](https://stackoverflow.com/posts/14696411/edit) in your answer (without an elaboration, this is not much more than a link-only answer)? (But ***without*** "Edit:", "Update:", or similar - the question/answer should appear as if it was written today.) – Peter Mortensen Nov 06 '21 at 23:19
  • Peter, I can't as I haven't used it much myself but rather chosen for inclusion in ALT Rescue images back in the day when I was the guy maintaining those; a simple link like that has helped me on SO so more than once... thank you for the query, anyways (only seen it today). – Michael Shigorin Jan 26 '22 at 17:20
4

If you want one MD5 hash value spanning the whole directory, I would do something like

cat *.py | md5sum
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ramon
  • 8,202
  • 4
  • 33
  • 41
3

Checksum all files, including both content and their filenames

grep -ar -e . /your/dir | md5sum | cut -c-32

Same as above, but only including *.py files

grep -ar -e . --include="*.py" /your/dir | md5sum | cut -c-32

You can also follow symlinks if you want

grep -aR -e . /your/dir | md5sum | cut -c-32

Other options you could consider using with grep

-s, --no-messages         suppress error messages
-D, --devices=ACTION      how to handle devices, FIFOs and sockets;
-Z, --null                print 0 byte after FILE name
-U, --binary              do not strip CR characters at EOL (MSDOS/Windows)
moander
  • 2,080
  • 2
  • 19
  • 13
  • How does that overcome the [problems with a defined sort order](https://stackoverflow.com/questions/1657232/how-can-i-calculate-an-md5-checksum-of-a-directory/7838345#7838345)? – Peter Mortensen Nov 06 '21 at 23:47
2

GNU find

find /path -type f -name "*.py" -exec md5sum "{}" +;
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
2

Technically you only need to run ls -lR *.py | md5sum. Unless you are worried about someone modifying the files and touching them back to their original dates and never changing the files' sizes, the output from ls should tell you if the file has changed. My unix-foo is weak so you might need some more command line parameters to get the create time and modification time to print. ls will also tell you if permissions on the files have changed (and I'm sure there are switches to turn that off if you don't care about that).

jmucchiello
  • 18,754
  • 7
  • 41
  • 61
  • 4
    This may fit some use cases, but generally you would want the checksum to reflect only the content and not the dates at all. For example, if I `touch` a file to change its date (but *not* its contents) then I would expect the checksum to be unchanged. – Todd Owen Oct 02 '12 at 06:22
2

Using md5deep:

md5deep -r FOLDER | awk '{print $1}' | sort | md5sum

Benjamin Trent
  • 7,378
  • 3
  • 31
  • 41
  • What is it supposed to do? What is the principle of operation? Why does it works? Can you [elaborate](https://stackoverflow.com/posts/24813545/edit) in your answer? (But ***without*** "Edit:", "Update:", or similar - the question/answer should appear as if it was written today.) – Peter Mortensen Nov 06 '21 at 23:44
  • OK, *doesntreallymatter* seems to have left the building (*"Last seen more than 7 years ago"*). Perhaps somebody else can chime it? – Peter Mortensen Nov 06 '21 at 23:46
2

I want to add that if you are trying to do this for files/directories in a Git repository to track if they have changed, then this is the best approach:

git log -1 --format=format:%H --full-diff <file_or_dir_name>

And if it's not a Git directory/repository, then the answer by ire_and_curses is probably the best bet:

tar c <dir_name> | md5sum

However, please note that tar command will change the output hash if you run it in a different OS and stuff. If you want to be immune to that, this is the best approach, even though it doesn't look very elegant on first sight:

find <dir_name> -type f -print0 | sort -z | xargs -0 md5sum | md5sum | awk '{ print $1 }'
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
rrlamichhane
  • 1,435
  • 2
  • 18
  • 34
1

I had the same problem so I came up with this script that just lists the MD5 hash values of the files in the directory and if it finds a subdirectory it runs again from there, for this to happen the script has to be able to run through the current directory or from a subdirectory if said argument is passed in $1

#!/bin/bash

if [ -z "$1" ] ; then

# loop in current dir
ls | while read line; do
  ecriv=`pwd`"/"$line
if [ -f $ecriv ] ; then
    md5sum "$ecriv"
elif [ -d $ecriv ] ; then
    sh myScript "$line" # call this script again
fi

done


else # if a directory is specified in argument $1

ls "$1" | while read line; do
  ecriv=`pwd`"/$1/"$line

if [ -f $ecriv ] ; then
    md5sum "$ecriv"

elif [ -d $ecriv ] ; then
    sh myScript "$line"
fi

done


fi
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
alan
  • 11
  • 1
  • 1
    I'm pretty sure that this script will fail if filenames contain spaces or quotes. I find this annoying with bash scripting, but what I do is change the IFS. – localhost Jun 24 '13 at 23:36
1

If you want really independence from the file system attributes and from the bit-level differences of some tar versions, you could use cpio:

cpio -i -e theDirname | md5sum
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
peterh
  • 11,875
  • 18
  • 85
  • 108
1

md5sum worked fine for me, but I had issues with sort and sorting file names. So instead I sorted by md5sum result. I also needed to exclude some files in order to create comparable results.

find . -type f -print0 \ | xargs -r0 md5sum \ | grep -v ".env" \ | grep -v "vendor/autoload.php" \ | grep -v "vendor/composer/" \ | sort -d \ | md5sum

MonkeyMonkey
  • 826
  • 1
  • 6
  • 19
0

There are two more solutions:

Create:

du -csxb /path | md5sum > file

ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum > /tmp/file

Check:

du -csxb /path | md5sum -c file

ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum -c /tmp/file
Nick
  • 1