Printing a sequence from a fasta file

Question

I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a '>' and then all the lines following until the next '>' are the sequence itself. For example:

>sequence1
ACTGACTGACTGACTG
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>sequence3
ACTGACTGACTGACTG

The way I'm currently getting the sequence I need is to use grep with -A, so I'll do

grep -A 10 sequence_name filename.fa

and then if I don't see the start of the next sequence in the file, I'll change the 10 to 20 and repeat until I'm sure I'm getting the whole sequence.

It seems like there should be a better way to do this. For example, can I ask it to print up until the next '>' character?

score 8 · Accepted Answer · answered Oct 01 '14 at 15:39

8

Using the > as the record separator:

awk -v seq="sequence2" -v RS='>' '$1 == seq {print RS $0}' file

>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG

answered Oct 01 '14 at 15:39

glenn jackman

238,783
38
220
352

+1 Nice. I presume you know you can save yourself the `-v` if you put `RS='>'` AFTER the script but before the file... – Mark Setchell Oct 01 '14 at 15:46
1

I do but I like to keep the variables up front and the files at the end (much like a BEGIN block can appear anywhere in the script, but is usually seen at the beginning). – glenn jackman Oct 01 '14 at 15:47

Mark Setchell · Answer 2 · 2014-10-01T16:03:25.883

Like this maybe:

awk '/>sequence1/{p++;print;next} /^>/{p=0} p' file

So, if the line starts with >sequence1, set a flag (p) to start printing, print this line and move to next. On subsequent lines, if the line starts with >, change p flag to stop printing. In general, print if the flag p is set.

Or, improving a little on your grep solution, use this to cut off the -A (after) context:

grep -A 999999 "sequence1" file | awk 'NR>1 && /^>/{exit} 1'

So, that prints up to 999999 lines after sequence1 and pipes them into awk. Awk then looks for a > at the start of any line after line 1, and exits if it finds one. Until then, the 1 causes awk to do its standard thing, which is print the current line.

score 1 · Answer 3 · answered Oct 01 '14 at 20:47

1

Using sed only:

sed -n '/>sequence3/,/>/ p' | sed '${/>/d}'

answered Oct 01 '14 at 20:47

Vytenis Bivainis

2,308
21
28

score 0 · Answer 4 · answered Oct 01 '14 at 16:48

0

$ perl -0076 -lane 'print join("\n",@F) if $F[0]=~/sequence2/' file

answered Oct 01 '14 at 16:48

dawg

98,345
23
131
206

score 0 · Answer 5 · answered Feb 03 '23 at 23:54

This question has excellent answers already. However, if you are dealing with FASTA records often, I would highly recommend Python's Biopython Module. It has many options and make life easier if you want to manipulate FASTA records. Here is how you can read and print the records:

from Bio import SeqIO
import textwrap

for seq_record in SeqIO.parse("input.fasta", "fasta"):
    print(f'>{seq_record.id}\n{seq_record.seq}')

#If you want to wrap the record into multiline FASTA format
#You can use textwrap module
for seq_record in SeqIO.parse("input.fasta", "fasta"):
    dna_sequence = str(seq_record.seq)
    wrapped_dna_sequence = textwrap.fill(dna_sequence, width=8)
    print(f'>{seq_record.id}\n{wrapped_dna_sequence}')

Printing a sequence from a fasta file

5 Answers5

Linked

Related