Remove multiple sequences from fasta file

Question

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:

>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

In an other file I have a list of headers of sequences that I would like to remove, like this:

>header1
>header5
>header12
[...]
>header145

The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,

while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt

It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?

Why not first transform your second file into a sed script with the delete commands, then apply that in one run against the data file? — daniu, Apr 11 '19 at 15:30
How big is the second file? More like a thousand lines, or a million lines? — Jerome, Apr 11 '19 at 15:34

score 5 · Answer 1 · answered Apr 11 '19 at 16:29

The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:

The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.

Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).

The sequence can span multiple lines.

A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.

Most of the presented methods will fail on a multi-fasta with multi-line sequences

The following will work always:

awk '(NR==FNR) { toRemove[$1]; next }
     /^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
    p' headers.txt file.fasta

This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.

my fasta always have the sequences contained in one line, so all the given answer should work (it is a step in a custom pipeline so I'm sure the input files are correctly formatted). Anyway, thanks for your answer — Loïs Rancilhac, Apr 11 '19 at 16:43

Ed Morton · Answer 2 · 2019-04-11T15:54:40.707

$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.

Alternatively:

$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

f is whether or not the most recently read >... line was found in the target array a[]. f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.

The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >. If you know both then which one you use is a style choice.

The alternative solution is the best solution for fasta files (the OP neglects to mention this), the first solution requires that all sequences have the same amount of lines and that is not always the case. — kvantour, Apr 11 '19 at 16:26

score 1 · Answer 3 · answered Apr 11 '19 at 15:54

1

You may use this awk:

awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt

answered Apr 11 '19 at 15:54

anubhava

761,203
64
569
643

HardcoreHenry · Answer 4 · 2019-04-11T15:47:11.940

One option is to create a long sed expression:

sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt

This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123...)

Using a file (as @daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.

score 0 · Accepted Answer · answered Apr 11 '19 at 15:41

0

Create a script with the delete commands from the second file:

sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed

Then apply that file to the first

sed -f commands.sed firstFile.txt

answered Apr 11 '19 at 15:41

daniu

14,137
4
32
53

4

In addition to being cumbersome to write and needing a temp file, that will fail in various ways given various values in `secondFile.txt` and various contents of `firstFile.txt`. To make it robust see https://stackoverflow.com/q/29613304/1745001. You could reduce the number of most likely issues by adding anchors when creating the script. You also don't need a capture group since `s#$.*$#/\1/` = `s#.*#/&/`. – Ed Morton Apr 11 '19 at 16:04

score 0 · Answer 6 · answered Apr 11 '19 at 15:43

0

This awk might work for you:

awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1

answered Apr 11 '19 at 15:43

mickp

1,679
7
23

score 0 · Answer 7 · 2019-04-14T15:02:23.993

0

try gnu sed,

sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f -  first_file.txt

prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's

edited Apr 14 '19 at 15:02

answered Apr 12 '19 at 14:24

score 0 · Answer 8 · answered Nov 22 '22 at 02:35

0

This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.

filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt

answered Nov 22 '22 at 02:35

timtimbruno

61
1
7

Remove multiple sequences from fasta file

8 Answers8

Linked