-2

I have a text file supplied.tsv with filepaths and a column with filesize as follows, I want to ensure that the filenames are unique

./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.cluster.stats  676
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.stats  788
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json  887
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json  887
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz  566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz  566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz.tbi  772

Expected output Yes all unique filenames

MY PLAN I will extract the first column from file

awk -F"\t" '{print $1}' supplied.tsv > supplied_firstcolumn.txt

Extract filename and then check the distinct lines. Kindly let me know how to do this efficiently.

Khaned
  • 441
  • 2
  • 14

2 Answers2

0
awk '{ fil[$1]++ } END { for (i in fil) { if (fil[i]>1) { print i" - "fil[i];dup++ } } if (dup < 1) { print "No duplicates" } }' files.txt

Create an array called fil with the filename as the index and increment the value every time the file is seen. At the end, loop through the fil array and if the value is greater than 1, print the filename and the count. Also increment a duplicates count (dup). If the dup variable is less that 1 at the end of the loop, print "No duplicates"

Raman Sailopal
  • 12,320
  • 2
  • 11
  • 18
0

how to do this efficiently

As you are interested in if, not how many duplicates you have I suggest stop processing after hitting 1st duplicate. I would do it following way. Let file.txt content be:

./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.cluster.stats  676
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.stats  788
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json  887
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json  887
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz  566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz  566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz.tbi  772

then

awk 'BEGIN{uniq=1}(++arr[$1]>=2){uniq=0;exit}END{print uniq ? "all unique" : "found nonunique"}' file.txt

output

found nonunique

Explanation: Firstly I set uniq to 1, which will stay such if not duplicates are found. Then for every line I increase counter in arr for given path ($1) and check if after that operation it is bigger or equal 2 - if it is this mean it is 2nd or following occurence, so I set uniq to 0 and end processing file using exit - or in other words jump to END. In END I print pending on uniq value, if you prefer to print only if duplicate were not found, you might use if(uniq){print "unique"} in END.

(tested in gawk 4.2.1)

Daweo
  • 31,313
  • 3
  • 12
  • 25