MD5 comparison between two text files

Question

I just started learning Linux shell scripting. I have to compare this two files in Linux shell scripting for version control example :

file1.txt

275caa62391ff4f3096b1e8a4975de40 apple
awd6s54g64h6se4h6se45wahae654j6 ball
e4rby1s6y4653a46h153a41bqwa54tvi cat
r53aghe4354hr35a4hr65a46eeh5j45ro castor

file2.txt

275caa62391ff4f3096b1e8a4975de40 apple
js65fg4a64zgr65f4w65ea465fa65gh7 ball
wroghah4a65ejdtse5z4g6sa7H658aw7 candle
wagjh54hr5ae454zrwrh354aha4564re castor

How to sort this text files in newly added(one which is added in file 2 but not in file 1) ,deleted(one which is deleted in file 2 but not in file 1) and changed files (have same name but different checksum) ? I tried using diff , bcompare , vimdiff but I am not getting a proper output as a text file.

Thanks in advance

for part 3 ( have same name but different checksum) --- try `md5sum -c file1 file2` — Avinash Yadav, Jan 20 '20 at 10:35

score 0 · Accepted Answer · answered Jan 20 '20 at 14:46

I don't know if such a command exist, but I've taken the liberty to write you a sorting mechanism in Bash. Although it's optimised, I suggest you recreate it in a language of your own choice.

#! /bin/bash

# Sets the array delimiter to a newline
IFS=$'\n'

# If $1 is empty, default to 'file1.txt'. Same for $2.
FILE1=${1:-file1.txt}
FILE2=${2:-file2.txt}

DELETED=()
ADDED=()
CHANGED=()

# Loop over array $1 and print content
function array_print {
        # -n creates a "pointer" to an array. This
        # way you can pass large arrays to functions.
        local -n array=$1
        echo "$1: "

        for i in "${array}"; do
                echo $i
        done
}

# This function loops over the entries in file_in and checks
# if they exist in file_tst. Unless doubles are found, a
# callback is executed.
function array_sort {
        local file_in="$1"
        local file_tst="$2"
        local callback=${3:-true}
        local -n arr0=$4
        local -n arr1=$5

        while read -r line; do

                tst_hash=$(grep -Eo '^[^ ]+' <<< "$line")
                tst_name=$(grep -Eo '[^ ]+$' <<< "$line")
                hit=$(grep $tst_name $file_tst)

                # If found, skip. Nothing is changed.
                [[ $hit != $line ]] || continue

                # Run callback
                $callback "$hit" "$line" arr0 arr1

        done < "$file_in"
}

# If tst is empty, line will be added to not_found. For file 1 this 
# means that file doesn't exist in file2, thus is deleted. Otherwise
# the file is changed.
function callback_file1 {
        local tst=$1
        local line=$2
        local -n not_found=$3
        local -n found=$4

        if [[ -z $tst ]]; then
                not_found+=($line)
        else
                found+=($line)
        fi
}

# If tst is empty, line will be added to not_found. For file 2 this
# means that file doesn't exist in file1, thus is added. Since the 
# callback for file 1 already filled all the changed files, we do 
# nothing with the fourth parameter.
function callback_file2 {
        local tst=$1
        local line=$2
        local -n not_found=$3

        if [[ -z $tst ]]; then
                not_found+=($line)
        fi
}

array_sort "$FILE1" "$FILE2" callback_file1 DELETED CHANGED 
array_sort "$FILE2" "$FILE1" callback_file2 ADDED CHANGED 

array_print ADDED
array_print DELETED
array_print CHANGED
exit 0

Since it might be hard to understand the code above, I've written it out. I hope it helps :-)

while read -r line; do
       tst_hash=$(grep -Eo '^[^ ]+' <<< "$line")
       tst_name=$(grep -Eo '[^ ]+$' <<< "$line")
       hit=$(grep $tst_name $FILE2)

       # If found, skip. Nothing is changed.
       [[ $hit != $line ]] || continue

       # If name does not occur, it's deleted (exists in 
       # file1, but not in file2)
       if [[ -z $hit ]]; then
               DELETED+=($line)
       else
       # If name occurs, it's changed. Otherwise it would
       # not come here due to previous if-statement.
               CHANGED+=($line)
       fi
done < "$FILE1"

while read -r line; do
       tst_hash=$(grep -Eo '^[^ ]+' <<< "$line")
       tst_name=$(grep -Eo '[^ ]+$' <<< "$line")
       hit=$(grep $tst_name $FILE1)

       # If found, skip. Nothing is changed.
       [[ $hit != $line ]] || continue

       # If name does not occur, it's added. (exists in 
       # file2, but not in file1)
       if [[ -z $hit ]]; then
               ADDED+=($line)
       fi
done < "$FILE2"

looks like this bash program compares the data only for four lines , I wanted to compare two text files which has many number of lines. Thank you so much for this code. — karkator, Jan 21 '20 at 07:45
@karkator Why do you think that? Data is taken from a file, regardless the length of the file. — Bayou, Jan 21 '20 at 09:05

score 0 · Answer 2 · answered Jan 20 '20 at 15:45

Files which are only in file1.txt:

 awk 'NR==FNR{a[$2];next} !($2 in a)' file2.txt file1.txt > only_in_file1.txt

Files which are only in file2.txt:

 awk 'NR==FNR{a[$2];next} !($2 in a)' file1.txt file2.txt > only_in_file2.txt

Then something like this answer: awk compare columns from two files, impute values of another column

e.g:

awk 'FNR==NR{a[$1]=$1;next}{print $0,a[$1]?a[$2]:"NA"}' file2.txt file1.txt  | grep NA | awk '{print $1,$2}' > md5sdiffer.txt

You'll need to come up with how you want to present these though.

There might be a more elegant way to loop though the final example (as opposed to finding those with NA and then re-filtering), however it's still enough to go off

MD5 comparison between two text files

2 Answers2