1

I'm looking for a way to pimp this command, which checks for a column starting with a certain string ("product=") and prints the respective column (and many following and the second and the third, based on ";" as delimiter).

awk 'BEGIN{FS = ";", OFS = "\t"} 
  {for (i=1;i<=NF;i++){if ($i ~/^product=/) 
  {print $2, $3, $i, $(i+1),$(i+2),$(i+3),$(i+4),$(i+5),$(i+6),$(i+7)}}}' file

for a file as such:

contig_19838    Prodigal:2.6    CDS 8893    10215   .   -   0   ID=PROKKA_33099;eC_number=3.5.99.8;gene=naaA;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:D3WZ85;locus_tag=PROKKA_33099;product=5-nitroanthranilic acid aminohydrolase
contig_19839    Prodigal:2.6    CDS 207 368 .   -   0   ID=PROKKA_33119;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_33119;product=hypothetical protein
contig_1984 Prodigal:2.6    CDS 101 853 .   -   0   ID=PROKKA_05585;inference=ab initio prediction:Prodigal:2.6,protein motif:CLUSTERS:PRK09421;locus_tag=PROKKA_05585;product=molybdate ABC transporter permease protein
contig_19840    Prodigal:2.6    CDS 50  352 .   +   0   ID=PROKKA_33120;eC_number=3.1.3.48;gene=cpsB;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q9AHD4;locus_tag=PROKKA_33120;product=Tyrosine-protein phosphatase CpsB

I would like to add the column which starts with "gene=" to the output which can be in different columns, but I'm not sure how to add a AND/OR statement.

I'm also having trouble to print the strings starting with "product" as the output is separated with whitespace and gets split into many columns. Hence, I printed quite some following columns (which of course looks weird), as I did not know how to combine this with the answers from here Using awk to print all columns from the nth to the last

So I would like to have an output such as

gene=naaA   product=5-nitroanthranilic acid aminohydrolase
    product=hypothetical protein
    product=molybdate ABC transporter permease protein
gene=cpsB   product=Tyrosine-protein phosphatase CpsB

for lines with and without the "gene=" field. Any ideas?

crazysantaclaus
  • 613
  • 5
  • 19

1 Answers1

1

Considering that your actual Input_file is same as shown sample if yes then could you please try following awk and let me know if this helps you.

awk '
{
  match($0,/gene=[^;]*/);
  gene_value=substr($0,RSTART,RLENGTH);
  match($0,/product=.*/);
  print gene_value,substr($0,RSTART,RLENGTH)
}
'   Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 1
    yes this worked perfectly! I'll check out how match() and RSTART and RLENGTH work. Thanks! – crazysantaclaus Mar 19 '18 at 17:36
  • @crazysantaclaus, glad that it helped you. `RSTART` and `RLENGTH` are the variables of `awk` which will be set once `match` has a TRUE value found of regex in it. – RavinderSingh13 Mar 19 '18 at 17:39
  • 1
    @ravandersingh13 your printing `product_value`, but you haven't defined it as you did with `gene_value`, right? How does that work? – crazysantaclaus Mar 19 '18 at 17:50
  • @crazysantaclaus, sorry for the typo, first I thought I will take product's match value too in variable and then print it, then thought no need of it :) edited it now, cheers :) – RavinderSingh13 Mar 19 '18 at 17:51