I am comparing two hashsets of my data from a year ago and through a series of bashing I have cut the two files down to just a hash value and the filename. We are talking close to 2 million entries.
From this great answer here I have been able to confirm where the hashes exist in both files, and where they don't exist in one and do in the other (eg, the second set has 40K files added to it, whereas there are only 4 files missing from the first set—that just don't appear in the second set).
I could verify that 40K files were added from old to new via:
awk 'FNR==NR{a[$1]=1;next}!($1 in a)' oldfile newfile | wc -l
and swapping the files around, I could see that only 4 files were missing.
I then realised I was basing this on hash alone. I'd actually like to base this on the filename.
Swapping the fieldnumber I was able to confirm a slightly different set of numbers. The additions to the newfile were not problem, but I noticed there were only 3 files missing from the first set.
Now what I want to do is take this to the next level and confirm the number of files that exist in both locations (easy enough):
awk 'FNR==NR{a[$2]=1;next}($2 in a)' oldfile newfile | wc -l
but where the first field will be different.
:~/working-hashset$ head file?
==> file1 <==
111 abc
222 def
333 ghi
444 jkl
555 fff
666 sss
777 vvv
==> file2 <==
111 abc
212 def
333 ggi
454 jjl
555 fff
656 sss
777 vss
:~/working-hashset$ awk 'FNR==NR{a[$1]=1;b[$2];next}($2 in b) {if(($1 in a)) print $0;}' file1 file2
111 abc
555 fff
:~/working-hashset$ awk 'FNR==NR{a[$1]=1;b[$2];next}($2 in b) {if(!($1 in a)) print $0;}' file1 file2
212 def
656 sss
:~/working-hashset$
This has been a work in progress (just writing this question which I started hours ago, I have solved some problems already... moving along).
I am at the stage where I have tested both files and have been able to detect hash collisions, good hashes, deleted files and new files.
:~/working-hashset$ head file?
==> file1 <==
111 dir1/aaa Original good
222 dir1/bbb Original changed
333 dir1/ccc Original good will move
444 dir1/ddd Original change and moved
555 dir2/eee Deleted
666 dir2/fff Hash Collision
999 dir2/zzz Deleted
==> file2 <==
111 dir1/aaa Good
2X2 dir1/bbb Changed
333 dir3/ccc Moved but good
4X4 dir3/ddd Moved and changed
111 dir4/aaa Duplicated
666 dir4/fzf Hash Collision
777 dir5/ggg New file
888 dir5/hhh New file
:~/working-hashset$ cat hashutil
#!/usr/bin/env bash
echo Unique to file 1
awk 'FNR==NR{a[$1]=1;b[$2];next}!($2 in b)' file2 file1 # in 1, !in 2
echo
echo Unique to file 2
awk 'FNR==NR{a[$1]=1;b[$2];next}!($2 in b)' file1 file2 # in 2, !in 1
echo
echo In both files and good
awk 'FNR==NR{a[$1]=1;b[$2];next}($2 in b) {if(($1 in a)) print $0;}' file2 file1 # in both files and good
echo
echo In both files, wrong hash
awk 'FNR==NR{a[$1]=1;b[$2];next}($2 in b) {if(!($1 in a)) print $0;}' file2 file1 # in both files and wrong hash
echo
echo hash collision
awk 'FNR==NR{a[$1]=1;b[$2];next}!($2 in b) {if(($1 in a)) print $0;}' file1 file2 # hash collision
echo
echo Done!
And this is the output:
Unique to file 1
333 dir1/ccc Original good will move
444 dir1/ddd Original change and moved
555 dir2/eee Deleted
666 dir2/fff Hash Collision
999 dir2/zzz Deleted
Unique to file 2
333 dir3/ccc Moved but good
4X4 dir3/ddd Moved and changed
111 dir4/aaa Duplicated
666 dir4/fzf Hash Collision
777 dir5/ggg New file
888 dir5/hhh New file
In both files and good
111 dir1/aaa Original good
In both files, wrong hash
222 dir1/bbb Original changed
hash collision
333 dir3/ccc Moved but good
111 dir4/aaa Duplicated
666 dir4/fzf Hash Collision
Done!
I now want to detect MOVED files.
I know that I'm going to need to break this into further "chunks" but they're going to be further delimited by the forward slash and at different levels.
I know about the number of fields (NF) and that I want to compare the first field (delimited by space) against the last field (delimited by slash) and upon matching then compare by the rest. If it's all the same then it's the same, else if the third condition only is different, it's moved.
I just don't even know where to start now (being 4am isn't helping)
Any help is appreciated.