0

I have a bash script with two nested for loops that reads in a line from a text file and then greps the line out of a different file. The text files (${AC}.ac.txt) are all lists like:

SF=0
SF=0,1
SF=0,2
SF=1
SF=1,2
SF=2

but with varying SF= options. I need grep to pull lines with only an exact match and not similar ones (eg. SF=1 and not SF=1,2). I have tried many different grep options such as: grep "[$SF[:blank:]]", grep -P "${SF}\t", grep "$SF ", grep -P "${SF} GT" (the grep target is always followed by GT), grep -P "${SF}\tGT", etc - no luck. I either get an empty file, or it doesn't filter out the other SF= options. I think the issue may be the way grep is reading in commas when it expands the bash variable? Can anyone help me with this? The loop is as follows:

for AC in {2..5}; do 
    for SF in $(cat ${AC}.ac.txt); do 
        grep  "${SF}" ${AC}_tmp1.vcf > ${AC}_tmp2.vcf
    done
done

And here are a couple of example lines from the target file:

NW_024423319.1  55690   .   A   C   407.13  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-2.153;ClippingRankSum=0;DP=27;ExcessHet=3.0103;FS=5.787;MQ=60;MQRankSum=0;QD=15.08;ReadPosRankSum=-0.519;SF=2    GT:GQ:PL:AD:DP  .:.:.:.:.   .:.:.:.:.   0/1:99:438,0,374:11,16:27
NW_024423319.1  55742   .   T   A   1396.9  PASS    AC=3;AF=0.5;AN=4;BaseQRankSum=0.716;ClippingRankSum=0;DP=57;ExcessHet=1.549;FS=0;MQ=49.3;MQRankSum=-0.537;QD=24.51;ReadPosRankSum=0.588;SF=1,2  GT:GQ:PL:AD:DP  .:.:.:.:.   0/1:99:272,0,731:20,9:29    1/1:84:1161,84,0:0,28:28
NW_024423319.1  65778   .   G   C   1445.14 PASS    AC=4;AF=1;AN=4;DP=35;ExcessHet=0.4576;FS=0;MQ=49.22;QD=30.73;SF=1,2 GT:DP:AD:PL:GQ  .:.:.:.:.   1/1:19:0,19:794,57,0:57 1/1:16:0,16:689,48,0:48

Thank you!!

Nevé B
  • 1
  • 1
  • 2
    Try to format your question so that the code and formulas are readable. As it is right now, only part of them are. – Ted Lyngmo Jul 27 '21 at 01:59
  • Try `for SF in $(grep '^SF=[0-9]$' ${AC}.ac.txt); do ...` – LMC Jul 27 '21 at 02:41
  • `grep -P "${SF}\t"` and `grep -P "${SF}\tGT"` look like they should work (I might recommend `grep "${SF}[[:blank:]]"`). Do your *.ac.txt files have anything weird/invisible in them, like DOS/Windows line endings? Try printing them with e.g. `LC_ALL=C cat -vet 1.ac.txt` -- that should add a `$` at the end of each line (indicating the end of line), but that's all. If you see anything else, it's a potential problem (e.g. `^M$` at the end of lines, that's DOS/Windows format). – Gordon Davisson Jul 27 '21 at 10:16
  • This seems like something which could be achieved more easily using ``bcftools``. – user438383 Jul 27 '21 at 17:01
  • @ Ted Lyngmo fixed it, thank you – Nevé B Jul 27 '21 at 17:20
  • @ Gordon Davisson This was it! Fixed the format and it worked. I appreciate your help, thanks so much!! – Nevé B Jul 27 '21 at 17:21
  • @ user438383 yes probably, but it's part of a longer script that someone else wrote, this was just a place where it was sticking for me – Nevé B Jul 27 '21 at 17:29

0 Answers0