Bash : remove lines from fasta file

Question

I wonder what is the best way to remove some lines from a fasta file in bash.

In the example above, let's say I want to remove the line where it's written 'GUITH', how do you remove this line and above lines, until you find some other '>' character ?

fasta file:

>B4KSI7_DROMO
RGLKRKPMALIKKLRKAKKEAPPNEKPEIVKTHLRNMIIVPEMTGSIIGVYNGKDFGQVE
VKPEMIGHYLGEFALTYKPVKH
>O46898_GUITH
RSLSKGPYIAAHLLKKLNNVDIQKPDVVIKTWSRSSTILPNMVGATIAVYNGKQHVPVYI
SDQMVGHKLGEFSPTRTFRSH
>Q7RT13_PLAYO
RGIDKKAKSLLKKLRKAKKECEVGEKPKPIPTHLRNMTIIPEMVGSIVAVHNGKQYTNVE
IKPEMIGYYLGEFSITYKHTRH

fasta file after filtering with bash:

>B4KSI7_DROMO
RGLKRKPMALIKKLRKAKKEAPPNEKPEIVKTHLRNMIIVPEMTGSIIGVYNGKDFGQVE
VKPEMIGHYLGEFALTYKPVKH
>Q7RT13_PLAYO
RGIDKKAKSLLKKLRKAKKECEVGEKPKPIPTHLRNMTIIPEMVGSIVAVHNGKQYTNVE
IKPEMIGYYLGEFSITYKHTRH

There is an other version of the question, but harder manipulation. Let's say you have a file with species names :

species.txt :

DROMO;
PLAYO;

And you want to delete lines in the fasta file where species are not present in the species.txt document. So you get the same output as above, but you get the lines to erase thanks to some other file (not entering 'GUITH' directly). What would be the best way of doing that ?

I tried with Python, which works but it's not optimized. And with bash, iI tried something with `sed`, but couldn't find any that would work. — Natha, Jan 12 '17 at 12:58
See also https://stackoverflow.com/questions/55636069/remove-multiple-sequences-from-fasta-file — tripleee, Apr 14 '19 at 15:15

Cyrus · Answer 1 · 2017-01-12T11:57:54.697

To remove the line where it's written 'GUITH':

sed 's/>/\n&/' fasta.txt | sed '/_GUITH/,/^$/d' | sed '/^$/d'

To delete lines in the fasta file where species are not present in the species.txt:

With GNU sed and bash:

sed 's/>/\n&/' fasta.txt | sed -n -f <( sed 's/;$//;s|.*|/_&$/,/^$/p|' species.txt ) | sed '/^$/d'

Output:

>B4KSI7_DROMO
RGLKRKPMALIKKLRKAKKEAPPNEKPEIVKTHLRNMIIVPEMTGSIIGVYNGKDFGQVE
VKPEMIGHYLGEFALTYKPVKH
>Q7RT13_PLAYO
RGIDKKAKSLLKKLRKAKKECEVGEKPKPIPTHLRNMTIIPEMVGSIVAVHNGKQYTNVE
IKPEMIGYYLGEFSITYKHTRH

James Brown · Accepted Answer · 2017-01-12T14:07:34.410

In awk:

$ awk '/^>/{p=1} /GUITH/{p=0} p' file
>B4KSI7_DROMO
RGLKRKPMALIKKLRKAKKEAPPNEKPEIVKTHLRNMIIVPEMTGSIIGVYNGKDFGQVE
VKPEMIGHYLGEFALTYKPVKH
>Q7RT13_PLAYO
RGIDKKAKSLLKKLRKAKKECEVGEKPKPIPTHLRNMTIIPEMVGSIVAVHNGKQYTNVE
IKPEMIGYYLGEFSITYKHTRH

Explained:

/^>/ { p=1 }    # turn print flag up for each record starting with >
/GUITH/ { p=0 } # turn print flag down for GUITH
p               # print if p

If you want to have a list of approved names:

$ cat list
DROMO
PLAYO
$ awk 'NR==FNR{a[$1];next} /^>/{n=split($0,b,"_"); p=(b[n] in a)} p' list file
>B4KSI7_DROMO
RGLKRKPMALIKKLRKAKKEAPPNEKPEIVKTHLRNMIIVPEMTGSIIGVYNGKDFGQVE
VKPEMIGHYLGEFALTYKPVKH
>Q7RT13_PLAYO
RGIDKKAKSLLKKLRKAKKECEVGEKPKPIPTHLRNMTIIPEMVGSIVAVHNGKQYTNVE
IKPEMIGYYLGEFSITYKHTRH

Explained:

NR==FNR { a[$1]; next }                   # read the list to array a
/^>/ { n=split($0,b,"_"); p=(b[n] in a) } # take the word after _ and if in a, enable print
p                                         # if p, print

Thanks a lot, the first solution works perfectly well, and the nice explanations! However, I get this error when trying from the list of approved names : `awk: illegal field $(), name "i"` . Do you know where it could come from ? — Natha, Jan 12 '17 at 14:04
Yeah, there was a typo, it should be: `NR==FNR { a[$1]; next }`, not `$i` like it was originally. What awk are you using? On Mac? — James Brown, Jan 12 '17 at 14:08

Bash : remove lines from fasta file

2 Answers2