check if filenames in a text file with filepaths is unique in the text file using bash

Question

I have a text file supplied.tsv with filepaths and a column with filesize as follows, I want to ensure that the filenames are unique

./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.cluster.stats  676
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.stats  788
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json  887
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json  887
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz  566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz  566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz.tbi  772

Expected output Yes all unique filenames

MY PLAN I will extract the first column from file

awk -F"\t" '{print $1}' supplied.tsv > supplied_firstcolumn.txt

Extract filename and then check the distinct lines. Kindly let me know how to do this efficiently.

Pipe to `sort | uniq -d`. This will show all the duplicate names. — Barmar, Dec 08 '20 at 20:56
Expected output just "Yes"? You don't want any output if there is a duplicate in the input, not even the name of the duplicate string and/or the line number it occurs on? — Ed Morton, Dec 08 '20 at 21:37

score 0 · Answer 1 · answered Dec 08 '20 at 21:13

awk '{ fil[$1]++ } END { for (i in fil) { if (fil[i]>1) { print i" - "fil[i];dup++ } } if (dup < 1) { print "No duplicates" } }' files.txt

Create an array called fil with the filename as the index and increment the value every time the file is seen. At the end, loop through the fil array and if the value is greater than 1, print the filename and the count. Also increment a duplicates count (dup). If the dup variable is less that 1 at the end of the loop, print "No duplicates"

score 0 · Answer 2 · answered Dec 09 '20 at 08:34

how to do this efficiently

As you are interested in if, not how many duplicates you have I suggest stop processing after hitting 1st duplicate. I would do it following way. Let file.txt content be:

./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.cluster.stats  676
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.stats  788
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json  887
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json  887
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz  566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz  566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz.tbi  772

then

awk 'BEGIN{uniq=1}(++arr[$1]>=2){uniq=0;exit}END{print uniq ? "all unique" : "found nonunique"}' file.txt

output

found nonunique

Explanation: Firstly I set uniq to 1, which will stay such if not duplicates are found. Then for every line I increase counter in arr for given path ($1) and check if after that operation it is bigger or equal 2 - if it is this mean it is 2nd or following occurence, so I set uniq to 0 and end processing file using exit - or in other words jump to END. In END I print pending on uniq value, if you prefer to print only if duplicate were not found, you might use if(uniq){print "unique"} in END.

(tested in gawk 4.2.1)

check if filenames in a text file with filepaths is unique in the text file using bash

2 Answers2