349

I have a shell script in which I need to check whether two files contain the same data or not. I do this a for a lot of files, and in my script the diff command seems to be the performance bottleneck.

Here's the line:

diff -q $dst $new > /dev/null

if ($status) then ...

Could there be a faster way to compare the files, maybe a custom algorithm instead of the default diff?

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
JDS
  • 16,388
  • 47
  • 161
  • 224
  • 16
    This is really nitpicking, but you're not asking to see if two files are the same, you're asking if two files have identical content. Same files have identical inodes (and same device). – Zano Nov 04 '14 at 09:08
  • 2
    Unlike the accepted answer, the measurement in [this answer](https://unix.stackexchange.com/a/153612/87704) does not recognize any notable difference between `diff` and `cmp`. – wedi May 04 '18 at 09:07

9 Answers9

546

I believe cmp will stop at the first byte difference:

cmp --silent $old $new || echo "files are different"
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Alex Howansky
  • 50,515
  • 8
  • 78
  • 98
  • 2
    How can I add more commands than only one? I want to copy a file and roboot. – feedc0de Jun 14 '14 at 15:09
  • @DanielBrunner: You can copy from the standard input to both a file and standard output by using the `tee` command. – Anders Rabo Thorbeck Jun 18 '14 at 06:51
  • 1
    Note that on my `cmp` I didn't have to shortcut it to echo, it will print a message if they differ or stay silent if they don't. – eresonance May 11 '15 at 17:44
  • @eresonance Right, the example is simply meant to show how you'd capture the return status in order to script a conditional. – Alex Howansky May 11 '15 at 17:56
  • 29
    `cmp -s $old $new` also works. `-s` is short for `--silent` – tim-phillips Mar 05 '16 at 01:09
  • 9
    As a speed boost, you should check the file sizes are equal before comparing the content. Does anyone know if cmp does this? – BeowulfNode42 Oct 03 '16 at 09:09
  • 5
    To run multiple commands, you can use brackets: cmp -s old new || { echo not; echo the; echo same; } – unfa Mar 15 '17 at 09:29
  • 16
    @BeowulfNode42 yes, any decent implementation of `cmp` will check file size first. Here's the GNU version, if you want to see the additional optimizations it includes: http://git.savannah.gnu.org/cgit/diffutils.git/tree/src/cmp.c – Ryann Graham Apr 06 '18 at 02:00
  • 1
    @RyanGraham thanks for the link. I see that if the -s or --silent switch is used then it will use the size check to immediately exit if the files are different sizes. I see it also has a few other optimisations like 0 sized files, or files with the same inode (ie both files are links to the same file). – BeowulfNode42 Apr 06 '18 at 08:07
  • I think the fastest (to type) is `cmp -l $old $new`, and you get no output for same and a lot for different files, lol – Man May 09 '18 at 04:56
  • 2
    What @Rohmer mentioned (the `-s` option) is also portable, `--silent` is not defined in POSIX standard. – jimmymcheung Mar 09 '23 at 13:03
80

I like @Alex Howansky have used 'cmp --silent' for this. But I need both positive and negative response so I use:

cmp --silent file1 file2 && echo '### SUCCESS: Files Are Identical! ###' || echo '### WARNING: Files Are Different! ###'

I can then run this in the terminal or with a ssh to check files against a constant file.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
pn1 dude
  • 4,286
  • 5
  • 30
  • 26
  • 23
    If your `echo success` command (or whatever other command you put in its place) fails, your "negative response" command will be run. You should use an "if-then-else-fi" construct. For example, like [this simple example](http://stackoverflow.com/a/16034851/5419599). – Wildcard Jan 06 '16 at 00:10
44

To quickly and safely compare any two files:

if cmp --silent -- "$FILE1" "$FILE2"; then
  echo "files contents are identical"
else
  echo "files differ"
fi

It's readable, efficient, and works for any file names including "` $()

VasiliNovikov
  • 9,681
  • 4
  • 44
  • 62
21

Because I suck and don't have enough reputation points I can't add this tidbit in as a comment.

But, if you are going to use the cmp command (and don't need/want to be verbose) you can just grab the exit status. Per the cmp man page:

If a FILE is '-' or missing, read standard input. Exit status is 0 if inputs are the same, 1 if different, 2 if trouble.

So, you could do something like:

STATUS="$(cmp --silent $FILE1 $FILE2; echo $?)"  # "$?" gives exit status for each comparison

if [[ $STATUS -ne 0 ]]; then  # if status isn't equal to 0, then execute code
    DO A COMMAND ON $FILE1
else
    DO SOMETHING ELSE
fi

EDIT: Thanks for the comments everyone! I updated the test syntax here. However, I would suggest you use Vasili's answer if you are looking for something similar to this answer in readability, style, and syntax.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Gregory Martin
  • 513
  • 5
  • 8
  • yes, but this is actually more complicated way of doing `cmp --silent $FILE1 $FILE2 ; if [ "$?" == "1" ]; then echo "files differ"; fi` which in turn is a more complicated way of doing `cmp --silent $FILE1 $FILE2 || echo "files differ"` because you can use command in expression directly. It substitutes for `$?`. As a result command's exist status will be compared. And that's what the other answer does. btw. If someone is struggling with `--silent`, it's not supported everywhere (busybox). use `-s` – papo Feb 13 '20 at 18:03
  • 3
    This can be simplified to just `if cmp --silent -- "$FILE1" "$FILE2"; then ... else ... fi` – VasiliNovikov Jul 19 '20 at 20:04
  • as @VasiliNovikov pointed out, you can just do `if command; then ... else ... fi` also, @Gregory your code has a common bash pitfall. `[[` is in fact a bash syntax and it should go as follows: `if [[ ... ]]` (notice the spaces) A very good URL to read up on common bash pitfalls: https://mywiki.wooledge.org/BashPitfalls – Chevraut Sep 18 '20 at 17:41
  • @Chevraut after re-reading this QA and noticing that all current suggestions are not fully safe, I've created my own answer (basically same as I wrote here in comment) – VasiliNovikov Sep 20 '20 at 06:47
6

You can compare by checksum algorithm like sha256

sha256sum oldFile > oldFile.sha256

echo "$(cat oldFile.sha256) newFile" | sha256sum --check

newFile: OK

if the files are distinct the result will be

newFile: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
rafapc2
  • 406
  • 4
  • 7
3

For files that are not different, any method will require having read both files entirely, even if the read was in the past.

There is no alternative. So creating hashes or checksums at some point in time requires reading the whole file. Big files take time.

File metadata retrieval is much faster than reading a large file.

So, is there any file metadata you can use to establish that the files are different? File size ? or even results of the file command which does just read a small portion of the file?

File size example code fragment:

  ls -l $1 $2 | 
  awk 'NR==1{a=$5} NR==2{b=$5} 
       END{val=(a==b)?0 :1; exit( val) }'
       
[ $? -eq 0 ] && echo 'same' || echo 'different'  

If the files are the same size then you are stuck with full file reads.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
jim mcnamara
  • 16,005
  • 2
  • 34
  • 51
1

Try also to use the cksum command:

chk1=`cksum <file1> | awk -F" " '{print $1}'`
chk2=`cksum <file2> | awk -F" " '{print $1}'`

if [ $chk1 -eq $chk2 ]
then
  echo "File is identical"
else
  echo "File is not identical"
fi

The cksum command will output the byte count of a file. See 'man cksum'.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Nono Taps
  • 123
  • 2
  • 11
  • 3
    That was my first thought too. However, hashes make sense if you have to compare the same file many times, as the hash is computed only once. If you're comparing it only once, then `md5` reads the whole file anyway, so `cmp`, stopping at the first difference, will be way faster. – Francesco Dondi Sep 06 '17 at 14:13
1

Doing some testing with a Raspberry Pi 3B+ (I'm using an overlay file system, and need to sync periodically), I ran a comparison of my own for diff -q and cmp -s; note that this is a log from inside /dev/shm, so disk access speeds are a non-issue:

[root@mypi shm]# dd if=/dev/urandom of=test.file bs=1M count=100 ; time diff -q test.file test.copy && echo diff true || echo diff false ; time cmp -s test.file test.copy && echo cmp true || echo cmp false ; cp -a test.file test.copy ; time diff -q test.file test.copy && echo diff true || echo diff false; time cmp -s test.file test.copy && echo cmp true || echo cmp false
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 6.2564 s, 16.8 MB/s
Files test.file and test.copy differ

real    0m0.008s
user    0m0.008s
sys     0m0.000s
diff false

real    0m0.009s
user    0m0.007s
sys     0m0.001s
cmp false
cp: overwrite âtest.copyâ? y

real    0m0.966s
user    0m0.447s
sys     0m0.518s
diff true

real    0m0.785s
user    0m0.211s
sys     0m0.573s
cmp true
[root@mypi shm]# pico /root/rwbscripts/utils/squish.sh

I ran it a couple of times. cmp -s consistently had slightly shorter times on the test box I was using. So if you want to use cmp -s to do things between two files....

identical (){
  echo "$1" and "$2" are the same.
  echo This is a function, you can put whatever you want in here.
}
different () {
  echo "$1" and "$2" are different.
  echo This is a function, you can put whatever you want in here, too.
}
cmp -s "$FILEA" "$FILEB" && identical "$FILEA" "$FILEB" || different "$FILEA" "$FILEB"
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Jack Simth
  • 131
  • 1
  • 8
0

If you are looking for more customizable diff for this, then git diff can be used.

if (git diff --no-index --quiet old.txt new.txt) then
  echo "files contents are identical"
else
  echo "files differ"
fi

--quiet

Disable all output of the program. Implies --exit-code.

--exit-code

Make the program exit with codes similar to diff(1). That is, it exits with 1 if there were differences and 0 means no differences.


Also, there are various algorithms and settings to choose from: [ref]

--diff-algorithm={patience|minimal|histogram|myers}

Choose a diff algorithm. The variants are as follows:

default, myers The basic greedy diff algorithm. Currently, this is the default.

minimal Spend extra time to make sure the smallest possible diff is produced.

patience Use "patience diff" algorithm when generating patches.

histogram This algorithm extends the patience algorithm to "support low-occurrence common elements".

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
the Hutt
  • 16,980
  • 2
  • 14
  • 44