0

I am attempting to write a script that takes three user inputs:

  1. The name of a text file with rows corresponding to a chromosome and basepair position, for example extract.txt.
  2. The name of a VCF file to search through for the values in the text file and extract, i.e. test.vcf.
  3. The desired output name of a new VCF file that contains the extracted rows from test.vcf

For those unfamiliar with VCF files, there is a lot of complex header info but the only header line that matters in this problem context is the final header line that begins with #CHROM. A toy example of one of these files looks like this (Note: in a VCF file these columns are tab separated but I am not sure how to format that here):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
Chr06 5567134 . T C 999 PASS DP=13782;VDB=0.0302;AF1=0.7154;AC1=1312;DP4=1987,2142,4337,4552;MQ=39;FQ=999;PV4=0.49,1.5e-10,0,1 GT:PL:DP:SP:GQ
Chr06 5567140 . G A 999 PASS DP=13537;VDB=0.0304;AF1=0.7489;AC1=1374;DP4=1744,1858,4383,4621;MQ=39;FQ=999;PV4=0.8,0.068,0,1 GT:PL:DP:SP:GQ
Chr06 5567195 . G T 999 PASS DP=12016;VDB=0.0284;AF1=0.384;AC1=704;DP4=3311,4518,1537,1850;MQ=37;FQ=999;PV4=0.0026,0,0,1 GT:PL:DP:SP:GQ
Chr06 5567224 . TAAAAA TAGAAACAAAAA 999 PASS INDEL;DP=13190;VDB=0.0352;AF1=0.1229;G3=0.7854,0.1806,0.03405;HWE=1.71e-05;AC1=225;DP4=4930,5006,542,685;MQ=40;FQ=999;PV4=0.00035,1,0,1 GT:PL:DP:SP:GQ
Chr06 5567247 . T A 999 PASS DP=14383;VDB=0.03;AF1=0.1233;G3=0.7772,0.1986,0.02415;HWE=0.022;AC1=226;DP4=6484,6134,587,654;MQ=41;FQ=999;PV4=0.0062,9.3e-07,0,0.16 GT:PL:DP:SP:GQ
Chr06 5567444 . TAAAAAA TAAAAAAA 999 PASS INDEL;DP=10303;VDB=0.0235;AF1=0.6037;G3=0.1784,0.4375,0.3841;HWE=0.0162;AC1=1107;DP4=1996,2224,2158,2446;MQ=46;FQ=999;PV4=0.7,1,6.3e-30,1 GT:PL:DP:SP:GQ
Chr06 5567497 . AG A 999 PASS INDEL;DP=5243;VDB=0.028;AF1=0.07142;G3=0.873,0.08398,0.04299;HWE=0.000541;AC1=131;DP4=2297,2010,0,146;MQ=46;FQ=999;PV4=0,0,0.00097,0 GT:PL:DP:SP:GQ
Chr06 5567499 . TAAA TGGCCAAATAAGCCACTCAAAAGAAATACAGCCAAAAACATCTACAAA 999 PASS INDEL;DP=5243;VDB=0.0273;AF1=0.3508;G3=0.4662,0.2686,0.2652;HWE=1.95e-12;AC1=643;DP4=1993,1638,195,545;MQ=46;FQ=999;PV4=0,1,5.1e-18,0 GT:PL:DP:SP:GQ
Chr06 5567583 . G C 999 PASS DP=12372;VDB=0.0279;AF1=0.09276;AC1=170;DP4=4794,5928,512,569;MQ=42;FQ=999;PV4=0.095,1,5.5e-31,1 GT:PL:DP:SP:GQ
Chr06 5567628 . G T 999 PASS DP=12657;VDB=0.0244;AF1=0.1049;AC1=192;DP4=5230,6194,197,578;MQ=40;FQ=999;PV4=1.2e-29,9.8e-06,0,1 GT:PL:DP:SP:GQ

A toy example of the txt file of items to extract looks like this:

Chr06\t5567140
Chr06\t5567583
Chr06\t5567224

I have created this txt file to explicitly contain the VCF delimiter between the Chromosome and position values although I can replace this with an actual tab value instead of '\t' if that makes it easier.

I have a bash script that I think is 90% of the way there. It seems there is an issue with my grep command within the looping through of the text file that I can't resolve.

Here is my bash script:

#!/bin/bash

# Reset getopts
OPTIND=1

# Read in variables
while getopts ":i:v:o:" opt; do
  case "$opt" in
    i)
      input=$OPTARG
      ;;
    v)
      vcf=$OPTARG
      ;;
    o)
      output=$OPTARG
      ;;
    esac
done

echo
echo "File containing SNP positions: '$input'"
echo "VCF file to extract from:      '$vcf'"
echo "Output filename:               '$output'"


LC_ALL=C grep '#CHROM' $vcf > $output

while IFS= read -r line
do

echo "$line"
LC_ALL=C grep '$line' $vcf >> $output

done < $input

The reading of arguments from the command line seems to work as expected but when I run the script, the output file that is created only has the #CHROM header line in it and none of the subsequent data.

I have also tried using grep -f with the name of the text file instead of looping but it had the same result.

Any help is greatly appreciated, or commentary on how to improve my existing code. Let me know if anything about the question is unclear.

Thanks!

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
C. John
  • 144
  • 1
  • 15

0 Answers0