Filtering a fasta file with sequences that match a certain string in another file

Question

With BLAST I have obtained a file with two tab-separated columns, one with species names and the other with a gene name (the name of the most similar gene in a reference database). My goal is to find in the first file all the species names for which the associated gene name is the most common one, and use this list of species to filter a second file (FASTA-formatted) keeping only those species.

For example if this is BLAST file:

sp1 XXX
sp2 AAA
sp3 XXX
sp4 XXX
sp5 BBB

The species that match the most common gene would be sp1, sp3 and sp4. Then in the FASTA file, which originally contains many species, I would like to keep only the sequences of species sp1, sp3 and sp4. So going from this:

>sp1
AATCGAGTCGT
>sp2
AATCGCGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
>sp5
AAACGAGTCGT

To this fasta file:

>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG

So far I've tried for a few days different approaches but for one reason or another it never works. I've tried creating this script (don.awk) to obtain the most common gene name (2nd column):

BEGIN {
  FS="\t"
}
{a[$2]++; if (a[$2] > comV) { comN=$2; comV=a[$2]} }
END {
    printf("%s\n", comN, comV)
}

And then running it with

nawk -f don.awk blastfile

Then I tried assigning the name to a variable (for example var1) and using it in

awk -F'\t' '$2 ~ /$var1/ {print $0}' blastfile > result

to first filter the original file. But for some reason I can't save the variables or I get errors. But I guess there must be an easier way.

please update the question with samples from both files (including some non-matching data from in the fasta file), the (wrong) output generated by your code and the (correct) expected output — markp-fuso, Mar 31 '23 at 13:51
Assuming that `var1` is a shell variable, see [how-do-i-use-shell-variables-in-an-awk-script](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script) but you don't need shell variables for this - a single awk script can/should do everything you've described. — Ed Morton, Apr 01 '23 at 12:06

j_b · Answer 1 · 2023-04-01T04:24:14.937

Using GNU awk with grep:

grep -A 1 -f <(awk '
BEGIN{PROCINFO["sorted_in"] = "@val_num_desc"}
{a[$2]++; b[$1]=$2}
END{for (i in a){j++; if (j==1){tgt=i; break}}
for (ii in b){if(tgt==b[ii]){print ">"ii}}}' blast.txt) input.fasta | grep -v '^--' > filtered.fasta

Output

>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG

Description

Set sort order for arrays to descending, numeric value of array values:

BEGIN{PROCINFO["sorted_in"] = "@val_num_desc"}

Populate array count of column 2 values using value as index. And populate second array using column 1 as key and column 2 as value.

{a[$2]++; b[$1]=$2}

Extract first array index as the target gene name for use in filtering species names from second array.

for (i in a){j++; if (j==1){tgt=i; break}}

Use extracted gene name above to get species names for use in filtering input.fasta file:

for (ii in b){if(tgt==b[ii]){print ">"ii}}

Use grep the -A 1 switch and the -f switch to extract the desired fasta records.

grep -A 1 -f <(awk...) input.fasta`

Pipe | to another grep to get rid of the -- between contiguous groups of matches and redirect the output to a file.

| grep -v '^--' > filtered.fasta

-or- Just using GNU awk:

awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_desc"}
BEGINFILE{fnum++}
fnum==1{a[$2]++; b[$1]=$2}
ENDFILE{if(fnum==1){
    for (i in a){j++; if (j==1){tgt=i; break}}} 
    for (ii in b){if(tgt!=b[ii]){delete b[ii]}}
}
fnum==2{for (i in b){if($0==">"i){
    getline n; 
    printf "%s\n%s\n", $0, n}}
}' blast.txt input.fasta

Details

Set sort order for arrays to descending, numeric value of array values:

BEGIN{PROCINFO["sorted_in"] = "@val_num_desc"}

For each file processed, increment variable 'fnum'

BEGINFILE{fnum++}

For first file processed, populate array count of column 2 values using value as index. And populate second array using column 1 as key and column 2 as value.

fnum==1{a[$2]++; b[$1]=$2}

When finished processing first file, Extract first array index as the target gene name for use in filtering species names from second array. Then filter second array by deleting entries that do not match the target gene name.

ENDFILE{if(fnum==1){
    for (i in a){j++; if (j==1){tgt=i; break}}} 
    for (ii in b){if(tgt!=b[ii]){delete b[ii]}}
}

If processing the second file, then compare the current row to each entry in the filtered array and if there is a match print the current row and the following row and redirect the output to a file.

fnum==2{for (i in b){if($0==">"i){
    getline n; 
    printf "%s\n%s\n", $0, n}}
}' blast.txt input.fasta > filtered.fasta

Timur Shtatland · Answer 2 · 2023-03-31T16:52:10.037

Use standard UNIX tools plus a perl one-liner to extract the most frequent gene. Use grep and cut to extract the species from the blast file. Use grep to extract the FASTA headers, then extract those with the desired species, then to extract just the sequence ids. Finally, to extract reads from FASTA file by IDs, use seqtk subseq (install it with conda).

cut -f2 blast.txt | sort | uniq -c | sort -k1,1n | tail -n1 | perl -lane 'print $F[1];' > most_frequent_gene.txt

grep -f most_frequent_gene.txt blast.txt | cut -f1 > species_w_most_frequent_gene.txt

grep '^>' input.fasta | grep -f species_w_most_frequent_gene.txt | grep -Po '^>\K\S+' > seqids_w_species_w_most_frequent_gene.txt

seqtk subseq input.fasta seqids_w_species_w_most_frequent_gene.txt > seqs_w_species_w_most_frequent_gene.fasta

Inputs and outputs used in this example:

head blast.txt input.fasta most_frequent_gene.txt seqids_w_species_w_most_frequent_gene.txt seqs_w_species_w_most_frequent_gene.fasta species_w_most_frequent_gene.txt
==> blast.txt <==
sp1     XXX
sp2     AAA
sp3     XXX
sp4     XXX
sp5     BBB

==> input.fasta <==
>seq1 sp1
ACTG
>seq2 sp2
ACTGA
>seq3 sp3
ACTGAC
>seq4 sp3
ACTGAC

==> most_frequent_gene.txt <==
XXX

==> seqids_w_species_w_most_frequent_gene.txt <==
seq1
seq3
seq4

==> seqs_w_species_w_most_frequent_gene.fasta <==
>seq1 sp1
ACTG
>seq3 sp3
ACTGAC
>seq4 sp3
ACTGAC

==> species_w_most_frequent_gene.txt <==
sp1
sp3
sp4

See also:

Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
-f file : Obtain patterns from file, one per line.

\K : Cause the regex engine to "keep" everything it had matched prior to the \K and not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.

To install seqtk and many other bioinformatics tools not listed here, use conda, specifically miniconda, for example:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

Ed Morton · Answer 3 · 2023-04-01T12:05:13.667

Using any awk:

$ cat tst.awk
NR == FNR {
    gene2species[$2] = ($2 in gene2species ? gene2species[$2] FS : "") $1
    if ( ++cnt[$2] > maxCnt ) {
        maxCnt = cnt[$2]
        maxGene = $2
    }
    next
}
FNR == 1 {
    split(gene2species[maxGene],tmp)
    for ( i in tmp ) {
        maxSpecies[">"tmp[i]]
    }
}
/^>/ {
    isMaxSpecies = ($1 in maxSpecies ? 1 : 0)
}
isMaxSpecies

$ awk -f tst.awk blastfile fastafile
>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG

The above assumes there will only be 1 species with the max count, as in the example you provided. If there can be multiple species with the same max count of genes then tweak it to:

$ cat tst.awk
NR == FNR {
    gene2species[$2] = ($2 in gene2species ? gene2species[$2] FS : "") $1
    if ( ++cnt[$2] > maxCnt ) {
        maxCnt = cnt[$2]
    }
    next
}
FNR == 1 {
    for ( gene in cnt ) {
        if ( cnt[gene] == maxCnt ) {
            split(gene2species[gene],tmp)
            for ( i in tmp ) {
                maxSpecies[">"tmp[i]]
            }
        }
    }
}
/^>/ {
    isMaxSpecies = ($1 in maxSpecies ? 1 : 0)
}
isMaxSpecies

e.g.:

$ cat blastfile
sp1 XXX
sp2 AAA
sp3 XXX
sp4 AAA
sp5 BBB

$ awk -f tst.awk blastfile fastafile
>sp1
AATCGAGTCGT
>sp2
AATCGCGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG

score 0 · Answer 4 · answered Mar 31 '23 at 12:04

If you want to do it using other tools than awk, you can do the following:

#!/bin/bash

mostCommont=$(awk '{ print $2 }' ./input | sort | uniq -c | sort -nr | head -1 | awk '{print $2}')
grep "$mostCommont" ./input | awk '{print $1}'

Or as a one-liner:

grep "$(awk '{ print $2 }' ./input | sort | uniq -c | sort -nr | head -1 | awk '{print $2}')" ./input | awk '{print $1}'

Just replace ./input with the path to your input file.

Based on your sample data, this will print the following:

sp1
sp3
sp4

If you want to keep the gene name, just remove the final ... | awk '{print $1}

Oh thanks for your answer. It seems to work well! However, this would only be half of my question. Now I should filter my fasta file to keep only the sequences of the species that are in this list. In a fasta file, one line starts with > and the name of the species, and the next line is its sequence, and this pattern for as many species as there are. — MarcD, Mar 31 '23 at 13:20

Filtering a fasta file with sequences that match a certain string in another file

4 Answers4