With BLAST I have obtained a file with two tab-separated columns, one with species names and the other with a gene name (the name of the most similar gene in a reference database). My goal is to find in the first file all the species names for which the associated gene name is the most common one, and use this list of species to filter a second file (FASTA-formatted) keeping only those species.
For example if this is BLAST file:
sp1 XXX
sp2 AAA
sp3 XXX
sp4 XXX
sp5 BBB
The species that match the most common gene would be sp1, sp3 and sp4. Then in the FASTA file, which originally contains many species, I would like to keep only the sequences of species sp1, sp3 and sp4. So going from this:
>sp1
AATCGAGTCGT
>sp2
AATCGCGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
>sp5
AAACGAGTCGT
To this fasta file:
>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
So far I've tried for a few days different approaches but for one reason or another it never works. I've tried creating this script (don.awk) to obtain the most common gene name (2nd column):
BEGIN {
FS="\t"
}
{a[$2]++; if (a[$2] > comV) { comN=$2; comV=a[$2]} }
END {
printf("%s\n", comN, comV)
}
And then running it with
nawk -f don.awk blastfile
Then I tried assigning the name to a variable (for example var1) and using it in
awk -F'\t' '$2 ~ /$var1/ {print $0}' blastfile > result
to first filter the original file. But for some reason I can't save the variables or I get errors. But I guess there must be an easier way.