15

I have got 2 files. Let us call them md5s1.txt and md5s2.txt. Both contain the output of a

find -type f -print0 | xargs -0 md5sum | sort > md5s.txt

command in different directories. Many files were renamed, but the content stayed the same. Hence, they should have the same md5sum. I want to generate a diff like

diff md5s1.txt md5s2.txt

but it should compare only the first 32 characters of each line, i.e. only the md5sum, not the filename. Lines with equal md5sum should be considered equal. The output should be in normal diff format.

3 Answers3

17

Easy starter:

diff <(cut -d' ' -f1 md5s1.txt)  <(cut -d' ' -f1 md5s2.txt)

Also, consider just

diff -EwburqN folder1/ folder2/
sehe
  • 374,641
  • 47
  • 450
  • 633
  • 1
    Extending this answer, if you really want *n* characters, something like: `diff <(cut -b-80 dump.csv) <(cut -b-80 dump2.csv)` (here, `n`=80) – Nick T Aug 30 '17 at 16:12
  • quick fwiw: extending the above (6 year old) comment, if you just want to check the md5, since it's a 32bit hex, the actual `cut` would be (specified as characters) `diff <( cut -c-32 f1.txt | sort) <(cut -c-32 f2.txt | sort )`, which could also be written as `cut -b-32` or `cut -c1-32` etc (but using `cut -d' ' -f1` is convenient in that you don't have to count characters). Also, fwiw #2, all those `diff` options won't necessarily always be available (eg on macOS, no `-E`), but that `diff` doesn't solve the OP problem anyway. Last fwiw #3: I actually use `fdupes` for the OP orig problem. – michael Jun 21 '23 at 08:52
3

Compare only the md5 column using diff on <(cut -c -32 md5sums.sort.XXX), and tell diff to print just the line numbers of added or removed lines, using --old/new-line-format='%dn'$'\n'. Pipe this into ed md5sums.sort.XXX so it will print only those lines from the md5sums.sort.XXX file.

diff \
    --new-line-format='%dn'$'\n' \
    --old-line-format='' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.new \
    > files-added
diff \
    --new-line-format='' \
    --old-line-format='%dn'$'\n' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.old \
    > files-removed

The problem with ed is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. Instead of piping the output of diff into ed, pipe it into the following command, which will use much less memory.

diff … | (
    lnum=0;
    while read lprint; do
        while [ $lnum -lt $lprint ]; do read line <&3; ((lnum++)); done;
        echo $line;
    done
) 3<md5sums.sort.XXX
Suzanne Soy
  • 3,027
  • 6
  • 38
  • 56
1

If you are looking for duplicate files fdupes can do this for you:

$ fdupes --recurse

On ubuntu you can install it by doing

$ apt-get install fdupes
holygeek
  • 15,653
  • 1
  • 40
  • 50