1

I have a file with millions of lines like the following:

chr1    18217866        .       T       A       52.2409 .       AB=0;ABP=0;AC=2;AF=0;AN=2;AO=2;CIGAR=1X;DP=2;DPB=2;DPRA=0;EPP=7.35324;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=7.37776;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=0;RO=0;RPP=7.35324;RPPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp      GT:DP:RO:QR:AO:QA:GL    1/1:2:0:0:2:74:-7.03,-0.60206,0

And I am trying to find all of the lines that match a given number in the second column where AF=0 like so:

grep '1821786*' file.vcf | cut -f 8 | awk -F \; '$4 == 0 {print $4}' | wc -l

The problem with this is that:

grep '1821786*' file.vcf | cut -f 8 |

prints : AF=0 so that this is not ever matched by the comparison of $4 == 0 in the awk statement.

Is there a way to strip off the AF= so that the awk statement will match 0 in the 4th column?

The Nightman
  • 5,609
  • 13
  • 41
  • 74

2 Answers2

3

It can all be done in single awk and with much more accuracy:

awk -F '[;[:blank:]]+' '$2 ~ /^1821786/ && $11 == "AF=0"{++n} END{print n}' file.vcf

-F '[;[:blank:]]+' sets input field separator as a semi-colon or a space/tab.

anubhava
  • 761,203
  • 64
  • 569
  • 643
0

Actually it looks like awk has a substitution function that is useful here:

grep '1821786*' file.vcf | cut -f 8 | awk -F \; '{sub(/AF=/,"")} $4 ==0 {print $4}' | wc -l

This can then be used on any of the other info in vcf files as needed.

The Nightman
  • 5,609
  • 13
  • 41
  • 74
  • Yes, see answer to your question from last week: http://stackoverflow.com/q/34798060/3776858 – Cyrus Jan 18 '16 at 21:27