0

I have an array with few variables for a loop, I want awk to check against one variable, and another persistent string.

my script looks like this:

wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
gunzip Homo_sapiens.GRCh38.109.gtf.gz

declare -a arr=("gene" "exon" "transcript" "three_prime_utr" "five_prime_utr")
for i in "${arr[@]}"
do
   echo "$i"
   tail -n +6 Homo_sapiens.GRCh38.109.gtf | awk '{ if ($3==$i && $7="+") {print $0}}' > Homo_sapiens.GRCh38.109.$i.gtf
   head -n5 Homo_sapiens.GRCh38.109.gtf | cat - Homo_sapiens.GRCh38.109.$i.gtf > Homo_sapiens.GRCh38.109.$i.gtf. 
   mv Homo_sapiens.GRCh38.109.$i.gtf. Homo_sapiens.GRCh38.109.$i.gtf
 done

rm Homo_sapiens.GRCh38.109.gtf

the results is as follows:

$ wc -l *.gtf
         5 Homo_sapiens.GRCh38.109.exon.gtf
         5 Homo_sapiens.GRCh38.109.five_prime_utr.gtf
         5 Homo_sapiens.GRCh38.109.gene.gtf
   3420366 Homo_sapiens.GRCh38.109.gtf
         5 Homo_sapiens.GRCh38.109.three_prime_utr.gtf
         5 Homo_sapiens.GRCh38.109.transcript.gtf

Meaning I am unable to use $i properly.

If I run one script individually e.g. using exon

   tail -n +6 Homo_sapiens.GRCh38.109.gtf | awk '{ if ($3=="exon" &&$7="+") {print $0}}' > Homo_sapiens.GRCh38.109.exon.gtf
   head -n5 Homo_sapiens.GRCh38.109.gtf | cat - Homo_sapiens.GRCh38.109.exon.gtf > Homo_sapiens.GRCh38.109.exon.gtf. 
   mv Homo_sapiens.GRCh38.109.exon.gtf. Homo_sapiens.GRCh38.109.exon.gtf

I get

   1648283 Homo_sapiens.GRCh38.109.exon.gtf
         5 Homo_sapiens.GRCh38.109.five_prime_utr.gtf
         5 Homo_sapiens.GRCh38.109.gene.gtf
   3420366 Homo_sapiens.GRCh38.109.gtf
         5 Homo_sapiens.GRCh38.109.three_prime_utr.gtf
         5 Homo_sapiens.GRCh38.109.transcript.gtf

The original Homo_sapiens.GRCh38.109.gtf file Original Data with column 3 and 7 columns of interest

The Homo_sapiens.GRCh38.109.exon.gtf file Exon filtered gtf file

zerberus
  • 73
  • 7
  • Can you explain what are you trying to do in the **big picture**? – Gilles Quénot Mar 15 '23 at 11:06
  • There's Perl's modules out there for parsing `gtf` files: https://metacpan.org/pod/Bio::FeatureIO::gtf and https://metacpan.org/pod/GenOO::GeneCollection::Factory::GTF – Gilles Quénot Mar 15 '23 at 11:08
  • sadly these modules are not useful to me, as I am trying to filter for specific variables. – zerberus Mar 15 '23 at 11:22
  • Best I can tell you should be doing all of that in 1 call to awk, not calling awk in a shell loop but without a [mcve] with concise, testable **textual** (no images and no links) sample input and expected output in your question there isn't much we could do to help you. So we can help you - please [edit] your question to clearly state what you're trying to do and provide minimal, complete sample input and expected output that demonstrates your needs and we could copy/paste to test a potential solution with, – Ed Morton Mar 15 '23 at 11:27
  • 1
    You may be interested in [how-do-i-use-shell-variables-in-an-awk-script](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script) but fixing THAT problem with `$i` in your awk script is not the right way to do whatever it is you're trying to do. – Ed Morton Mar 15 '23 at 11:29
  • the last comment did it. – zerberus Mar 15 '23 at 11:32
  • 2
    I was afraid of that. You're taking the band-aid instead of letting us help you set the broken leg. Oh well, you can always ask a new question if you'd like to make your script vastly more efficient and robust. At least copy/paste your script into http://shellcheck.net and fix the issues it tells you about though to patch some of the holes. – Ed Morton Mar 15 '23 at 11:40
  • Did the shellcheck, how would I be able to improve on the code? – zerberus Mar 15 '23 at 11:53
  • See [my previous comment](https://stackoverflow.com/questions/75743745/using-array-variable-within-loop-for-two-variables-awk#comment133617638_75743745) for how you'd need to improve your question so we could help you. – Ed Morton Mar 15 '23 at 11:54
  • 1
    You should just accept your own answer to this one though (or not) and ask a new question if you want to do that as it'd require a big change to this question. – Ed Morton Mar 15 '23 at 11:56

1 Answers1

0

Here the is the correct code -

declare -a arr=("gene" "exon" "transcript" "three_prime_utr" "five_prime_utr")
for i in "${arr[@]}"
do
   echo "$i"
   tail -n +6 Homo_sapiens.GRCh38.109.gtf | awk -v var="$i" '{ if ($3==var && $7="+") {print $0}}' > Homo_sapiens.GRCh38.109."$i".gtf
   head -n5 Homo_sapiens.GRCh38.109.gtf | cat - Homo_sapiens.GRCh38.109."$i".gtf > Homo_sapiens.GRCh38.109."$i".gtf. 
   mv Homo_sapiens.GRCh38.109."$i".gtf. Homo_sapiens.GRCh38.109."$i".gtf
   convert2bed --max-mem=40G -i gtf < Homo_sapiens.GRCh38.109."$i".gtf. >  Homo_sapiens.GRCh38.109."$i".gtf.bed
 done

rm Homo_sapiens.GRCh38.109.gtf
zerberus
  • 73
  • 7
  • 1
    Mainly just be aware that if anything fails in your script you'll ignore that and continue processing thereby producing bad output and you'll destroy your input file. Look into how to test exit status (e.g. with `&&`) before using output and/or moving, removing, or overwriting any input files. – Ed Morton Mar 15 '23 at 11:59