3

I'm trying to compare a list of files in two repos to try to flag which ones have changed. The problem is, my code says they are all different. But inspecting each hash digest shows that many digests are identical.

while IFS= read -r filename;
  do
    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    # inspecting the digest of each file individually         #
    # shows many files are identical and so are the digests   #
    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    md5 old/$filename; # a456cca87913a4788d980ba4c2f254be
    md5 new/$filename; # a456cca87913a4788d980ba4c2f254be
    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    # the below conditional is only supposed to echo "differs"    #
    # if the two digests are different                            #
    # but, instead, it echoes "differs" on every file comparison  #
    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    [[ $(md5 old/$filename) = $(md5 new/$filename) ]] || echo differs; # differs
  done < files-to-compare.txt

How can I fix this bug and only get the files that are different to report?

Edit

Also, note using == instead of = as in

$(md5 old/$filename) == $(md5 new/$filename) ]] || echo differs; 

yields exactly the same buggish output.

Edit2

A comment suggests using quotes. That also doesn't work.

"$(md5 old/$filename)" == "$(md5 new/$filename)" ]] || echo differs; 
Let Me Tink About It
  • 15,156
  • 21
  • 98
  • 207

4 Answers4

3

Instead of computing MD5 checksums, you could use the diff command which compares file contents. Its primary use is to processes files line-by-line and compare their differences (and generate patches) but it can just as easily be used for this purpose. It returns an exit of 0 if there are no differences between the two files and 1 if there are any differences.

while IFS= read -r filename;
  do
    if ! diff "old/$filename" "new/$filename" > /dev/null;
    then
      echo "“$filename” differs"
    fi
  done < files-to-compare.txt

If you’re using GNU diff, you could simply use its -q, --brief option which reports only that the files differ (instead of detailing how they differ):

while IFS= read -r filename;
  do
    diff -q "old/$filename" "new/$filename"
  done < files-to-compare.txt
Anthony Geoghegan
  • 11,533
  • 5
  • 49
  • 56
  • `diff` has a `--recursive` option that will compare two subdirectories of files rendering most of the Bash functionality redundant here. – dawg Nov 30 '18 at 14:49
  • @dawg I regularly use the `-r` option but in this case, it looks like the OP already has a listed of specific filenames to be checked in `files-to-compare.txt` (it's also not clear that they are using GNU diff). – Anthony Geoghegan Nov 30 '18 at 14:57
  • The recursive option is also on BSD. Don't know if it is POSIX – dawg Nov 30 '18 at 15:05
  • @dawg I try to avoid suggesting the long (double-dash multiple-character) version of options -- unless I know for sure the user is using GNU software. In this case, it's safer to suggest `-r` over `--recursive`. – Anthony Geoghegan Nov 30 '18 at 15:15
  • `diff -r` is indeed [supported by POSIX](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/diff.html). – chepner Nov 30 '18 at 16:05
  • 1
    This is a good solution. Comparing MD5 checksums is a long-winded, slow, and possibly unreliable way of checking if two files differ. See [Are there two known strings which have the same MD5 hash value?](https://crypto.stackexchange.com/q/1434). However, `cmp -s` is a better option than `diff -q`. The `-q` option to `diff` is not POSIX, and `diff`'s handling of binary files is "implementation-defined". `cmp -s` is normally optimized for detecting if files are different. See [Fastest way to tell if two files are the same in Unix/Linux?](https://stackoverflow.com/q/12900538/4154375). – pjh Nov 30 '18 at 20:39
3

Here is your script corrected:

while IFS= read -r filename;
    do
        # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
        # inspecting the digest of each file individually         #
        # shows many files are identical and so are the digests   #
        # It also prints MD5 (full file path) = md5_signature!    #
        # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
        md5 "old/$filename"              # please use double quotes
        md5 "new/$filename" 
        # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
        # Using -q eliminates all output from md5 except the sig      #
        # Your script now works correctly                             #
        # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

        [[ $(md5 -q "old/$filename") == $(md5 -q "new/$filename") ]] || echo differs; # differs
    done < files.txt

Problems:

  1. You had a typo of new/$fullfile rather than new/$filename
  2. You should use "new/$filename" (ie, use double quotes) around the file name expansions
  3. Use md5 -q to compare output of md5 on different files. Otherwise md5, by default, prints the input file path in the form of MD5 (full_path/base_name) = 2504fcc0c0a57d14aa6b4193b5efaf94. Since these paths are guaranteed to be different in two different directories, the different path names will cause the failure in the string comparison.

The comments above assume you are using md5 on BSD or, likely, on macOS.

Here is an alternate solution that works both on Linux with md5sum and BSD with md5. Just feed the content of the file to the stdin of either program and only the md5 signature is printed:

$ md5 <new/file.pdf
2504fcc0c0a57d14aa6b4193b5efaf94

vs if you use the file name, the path is printed and the MD5 hash signature used is printed:

$ md5 new/file.pdf
MD5 (new/file.pdf) = 2504fcc0c0a57d14aa6b4193b5efaf94

The same holds true for md5sum on Linux or GNU core utilities.

dawg
  • 98,345
  • 23
  • 131
  • 206
  • I thought using input redirection was a good idea to print only the MD5 digest (without the file name) but when you use input redirection with *md5sum (GNU coreutils) 8.25*, it prints the digest and then a `-` for the file name. :( – Anthony Geoghegan Dec 03 '18 at 02:20
  • Since you will consistently have a space and a dash: `' -'` they will compare the same. So only the signature is significant in that case... – dawg Dec 03 '18 at 04:36
2

on my Linux ubuntu, there is the md5sum command: it prints the digest and the filename:

md5sum myFile
215e0f7b4ea9fd9ea5f31106155839fe  myFile

I mean you need to extract only the digest from the output:

md5sum myFile | sed 's/^\([^[:blank:]]*\).*$/\1/g'
215e0f7b4ea9fd9ea5f31106155839fe

Then use this last command line in the test:

...
[[ $(md5sum old/"${filename}" | sed 's/^\([^[:blank:]]*\).*$/\1/g') = $(md5sum new/"${filename}" | sed 's/^\([^[:blank:]]*\).*$/\1/g') ]] || echo differs;
...
Jay jargot
  • 2,745
  • 1
  • 11
  • 14
  • 1
    `md5sum` is linux only. BSD and MacOS use `md5` [See this](https://stackoverflow.com/questions/1299833/bsd-md5-vs-gnu-md5sum-output-format) – dawg Nov 30 '18 at 15:50
  • ha ok, did not notice, thank you, I am editing the answer to include your info. – Jay jargot Nov 30 '18 at 15:55
  • @sawg: oups, sorry, you already answered correctly. removing my edit, leaving only linux part. – Jay jargot Nov 30 '18 at 16:00
  • 1
    You can also read from standard input (`md5 < old/"$filename"`), which I was going to post until I saw this. (I was unaware of the `-q` option.) – chepner Nov 30 '18 at 16:00
  • 1
    It's unclear if the OP is getting an error because they *only* have `md5sum` and not `md5`, or if the problem is simply with the filename being included in the output of `md5`. I think your (deleted) edit was better than the current answer. (`md5sum` can also read from standard input, without needing to resort to the `sed` pipeline.) – chepner Nov 30 '18 at 16:02
  • @chepner: true (removed my extra edit to leave the most probable correct answer to @dawg: `md5 -q` – Jay jargot Nov 30 '18 at 16:03
  • If the OP really is using `md5`, then bringing up `md5sum` really isn't relevant. – chepner Nov 30 '18 at 16:03
  • @chepner: bah! it is written, I left it like that for History – Jay jargot Nov 30 '18 at 16:04
1

To view only the difference from two files you can use grep and it will print the different lines only.

grep -v -F -x -f filename1 filename2

Also comm can be used for this purpose to print only the difference between two files.

comm -13 <(sort filename1) <(sort filename2)

wuseman
  • 1,259
  • 12
  • 20
  • Are you sure that `diff` made exactly for the purpose of comparing two text files wouldn't be better? – liborm Nov 30 '18 at 13:09
  • Hello, so as comm. 'comm (1) - compare two sorted files line by line' – wuseman Nov 30 '18 at 13:15
  • As a random aside you could also use `cmp`, but it has no real advantage over `diff` for text files, and will only tell you that they differ and where. – Paul Hodges Nov 30 '18 at 14:48
  • Also, `grep -v -F -x -f filename1 filename2` would tell you about any lines in 2 that are not in 1, but what about lines in 1 that are not in 2? Likewise, it says nothing of ordering. Better to use `diff`. – Paul Hodges Nov 30 '18 at 14:50