diff files comparing only first n characters of each line

Question

I have got 2 files. Let us call them md5s1.txt and md5s2.txt. Both contain the output of a

find -type f -print0 | xargs -0 md5sum | sort > md5s.txt

command in different directories. Many files were renamed, but the content stayed the same. Hence, they should have the same md5sum. I want to generate a diff like

diff md5s1.txt md5s2.txt

but it should compare only the first 32 characters of each line, i.e. only the md5sum, not the filename. Lines with equal md5sum should be considered equal. The output should be in normal diff format.

score 17 · Answer 1 · answered May 18 '11 at 15:43

17

Easy starter:

diff <(cut -d' ' -f1 md5s1.txt)  <(cut -d' ' -f1 md5s2.txt)

Also, consider just

diff -EwburqN folder1/ folder2/

answered May 18 '11 at 15:43

sehe

374,641
47
450
633

1

Extending this answer, if you really want *n* characters, something like: `diff <(cut -b-80 dump.csv) <(cut -b-80 dump2.csv)` (here, `n`=80) – Nick T Aug 30 '17 at 16:12
quick fwiw: extending the above (6 year old) comment, if you just want to check the md5, since it's a 32bit hex, the actual `cut` would be (specified as characters) `diff <( cut -c-32 f1.txt | sort) <(cut -c-32 f2.txt | sort )`, which could also be written as `cut -b-32` or `cut -c1-32` etc (but using `cut -d' ' -f1` is convenient in that you don't have to count characters). Also, fwiw #2, all those `diff` options won't necessarily always be available (eg on macOS, no `-E`), but that `diff` doesn't solve the OP problem anyway. Last fwiw #3: I actually use `fdupes` for the OP orig problem. – michael Jun 21 '23 at 08:52

Suzanne Soy · Answer 2 · 2011-09-18T13:28:54.440

Compare only the md5 column using diff on <(cut -c -32 md5sums.sort.XXX), and tell diff to print just the line numbers of added or removed lines, using --old/new-line-format='%dn'$'\n'. Pipe this into ed md5sums.sort.XXX so it will print only those lines from the md5sums.sort.XXX file.

diff \
    --new-line-format='%dn'$'\n' \
    --old-line-format='' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.new \
    > files-added
diff \
    --new-line-format='' \
    --old-line-format='%dn'$'\n' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.old \
    > files-removed

The problem with ed is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. Instead of piping the output of diff into ed, pipe it into the following command, which will use much less memory.

diff … | (
    lnum=0;
    while read lprint; do
        while [ $lnum -lt $lprint ]; do read line <&3; ((lnum++)); done;
        echo $line;
    done
) 3<md5sums.sort.XXX

score 1 · Answer 3 · answered Sep 18 '11 at 14:05

1

If you are looking for duplicate files fdupes can do this for you:

$ fdupes --recurse

On ubuntu you can install it by doing

$ apt-get install fdupes

answered Sep 18 '11 at 14:05

holygeek

15,653
1
40
50

diff files comparing only first n characters of each line

3 Answers3

Linked