82

Working on a Fedora Constantine box. I am looking to diff two directories recursively to check for source changes. Due to the setup of the project (prior to my own engagement with said project! sigh), the directories contain both source and binaries, as well as large binary datasets. While diffing eventually works on these directories, it would take perhaps twenty seconds if I could ignore the binary files.

As far as I understand, diff does not have an 'ignore binary file' mode, but does have an ignore argument which will ignore regular expression within a file. I don't know what to write there to ignore binary files, regardless of extension.

I'm using the following command, but it does not ignore binary files. Does anyone know how to modify this command to do this?

diff -rq dir1 dir2
codeforester
  • 39,467
  • 16
  • 112
  • 140
Zéychin
  • 4,135
  • 2
  • 28
  • 27
  • 2
    Try using `cmp` instead of `diff`, will not ignore binary files, but should be faster – Fredrik Pihl Jul 15 '11 at 19:16
  • 2
    eek. this is the poster-child justification for source control. if you're not using it, you should be. if the decision isn't in your hands, you should argue passionately. your problem would disappear with a proper git setup... – fearlesstost Jul 15 '11 at 19:56
  • 6
    Oh believe me. I know. I'm doing undergraduate research and this isn't quite setup the way it should be. Believe me. I KNOW. CVS/SVN/GIT would fix this. Know what's worse than that? I was assigned to work on a Fortran project with little to no documentation. There's 8 versions of the project in this directory and each one has different makefiles that (almost ;)) do the same thing. Believe you me, I am arguing with my overseer as well as I can. – Zéychin Jul 15 '11 at 20:03
  • @FredrikPihl I [don't think](https://www.gnu.org/software/diffutils/manual/html_node/cmp-Options.html) cmp supports directories. Let alone recursively. Did it support directories 10 yrs ago? – Darren Ng Aug 13 '21 at 08:55

6 Answers6

68

Kind of cheating but here's what I used:

diff -r dir1/ dir2/ | sed '/Binary\ files\ /d' >outputfile

This recursively compares dir1 to dir2, sed removes the lines for binary files(begins with "Binary files "), then it's redirected to the outputfile.

Shannon VanWagner
  • 681
  • 1
  • 5
  • 3
  • 7
    @Serg You can exclude files using the `-x` flag. Try `diff -r -x '*.xml' dir1 dir2` Also, `man diff` for more info. – xdhmoore Apr 03 '13 at 20:17
  • 1
    If you are on system with different language, replace `Binary\ files\ ` with the appropriate word in your language. It should be the first one or two words. In German its `Binärdateien\ ` – kap Apr 13 '17 at 12:44
  • 1
    @xdhmoore Thanks for the comment! To add to it, `-x` is also repeatable, for if you want to exclude _multiple_ patterns. Something like `-x '*.ext1' -x '*.ext2' -x 'ext3'`. – Vasan Jun 06 '18 at 17:54
  • Any benefit of using `sed` over just `grep -v 'Binary files'`? – bluenote10 Aug 12 '21 at 08:53
  • @bluenote10 yes, I think that `grep -v` is definitely more appropriate, for this use case. – Pierre Jun 16 '23 at 09:18
33

Maybe use grep -I (which is equivalent to grep --binary-files=without-match) as a filter to sort out binary files.

dir1='folder-1'
dir2='folder-2'
IFS=$'\n'
for file in $(grep -Ilsr -m 1 '.' "$dir1"); do
   diff -q "$file" "${file/${dir1}/${dir2}}"
done
jon
  • 378
  • 2
  • 2
14

I came to this (old) question looking for something similar (Config files on a legacy production server compared to default apache installation). Following @fearlesstost's suggestion in the comments, git is sufficiently lightweight and fast that it's probably more straightforward than any of the above suggestions. Copy version1 to a new directory. Then do:

git init
git add .
git commit -m 'Version 1'

Now delete all the files from version 1 in this directory and copy version 2 into the directory. Now do:

git add .
git commit -m 'Version 2'
git show

This will show you Git's version of all the differences between the first commit and the second. For binary files it will just say that they differ. Alternatively, you could create a branch for each version and try to merge them using git's merge tools.

Lucas Wiman
  • 10,021
  • 2
  • 37
  • 41
7

If the names of the binary files in your project follow a specific pattern (*.o, *.so, ...) as they usually do, you can put those patterns in a file and specify it using -X (hyphen X).

Contents of my exclude_file

*.o
*.so
*.git

Command:

diff -X exclude_file -r . other_tree > my_diff_file

UPDATE:

-x can be used instead of -X, to specify exclusion patterns on the command line rather than in a file:

diff -r -x *.o -x *.so -x *.git dir1 dir2
simlev
  • 919
  • 2
  • 12
  • 26
Mohan S Nayaka
  • 345
  • 3
  • 9
  • 1
    Its is -x NOT -X. – dpaks May 12 '17 at 05:50
  • 2
    @code_dweller Both exist: `-x` is for excluding a pattern on the command line, while `-X` indicates the file containing all the patterns to be excluded. – simlev Sep 18 '19 at 15:25
  • The last command given in the answer should have quoting around the stars, otherwise the shell will expand them (prior to calling `diff`) according to files present **in the current directory**. Thus, the command should read `diff -rx '*.o' -x '*.so' -x '*.git' dir1 dir2`. – frougon Sep 16 '21 at 15:32
0

Well, as a crude sort of check, you could ignore files that match /\0/.

Troy
  • 1,599
  • 14
  • 28
0

Use a combination of find and the file command. This requires you to do some research on the output of the file command in your directory; below I'm assuming that the files you want to diff is reported as ascii. OR, use grep -v to filter out the binary files.

#!/bin/bash

dir1=/path/to/first/folder
dir2=/path/to/second/folder

cd $dir1
files=$(find . -type f -print | xargs file | grep ASCII | cut -d: -f1)

for i in $files;
do
    echo diffing $i ---- $dir2/$i
    diff -q $i $dir2/$i
done

Since you probably know the names of the huge binaries, place them in a hash-array and only do the diff when a file is not in the hash,something like this:

#!/bin/bash

dir1=/path/to/first/directory
dir2=/path/to/second/directory

content_dir1=$(mktemp)
content_dir2=$(mktemp)

$(cd $dir1 && find . -type f -print > $content_dir1)
$(cd $dir2 && find . -type f -print > $content_dir2)

echo Files that only exist in one of the paths
echo -----------------------------------------
diff $content_dir1 $content_dir2    

#Files 2 Ignore
declare -A F2I
F2I=( [sqlite3]=1 [binfile2]=1 )

while read f;
do
    b=$(basename $f)
    if ! [[ ${F2I[$b]} ]]; then
        diff $dir1/$f $dir2/$f
    fi
done < $content_dir1
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130