Unable to split the sequence from a Fasta File

Question

my fasta file has multiple sequences like the following

>header1
MPANFTE
GSFDSSG
>header2
MDNASFS
EPWPANA

so I am writing a code to remove the headers and the output looks like this in a temporary file:

MPANFTEGSFDSSG

MDNASFSEPWPANA

So far, I have come up with this code: but it is not giving me the exact output.

import sys,subprocess

# open the file
my_file = open("gpcr.fasta")
# read the contents
my_gpcr = my_file.readlines()
for line in my_gpcr:
    if '>' == line[0]:
        header = line
    else:   
        tempf = open('temp.fasta', 'w')
        tempf.write(header)
        tempf.write(line)
        tempf.close()
        print line

Casper S · Answer 1 · 2016-08-08T21:52:35.713

I'm assuming that what you want to do is to print a file containing only one line per sequence? You can use an awk script:

awk 'BEGIN{newline=0}{if(/^>/){if(newline==1){print ""} newline=1}else printf $i}END{print }' gpcr.fasta

If you want to use Python, anyway, this is how I would do it:

import sys, re

# open the file                                                                                            
my_file = open("gpcr.fasta")

newline = 0
for line in my_file:
    line = line.rstrip('\n')
    if re.match('>', line):
        # after we read the first header, we will print a new line when we see one to separate concatenated sequences                                                                                                   
        if newline == 1:
            print
        newline = 1
    else:
        sys.stdout.write(line)

print

Also, a header starting with '>' is not a real FASTA header. Only > should be at the beginning of the line.

Thank you so much, now it's working. And yes it doesn't start with '<' , I was unable to type this in stackoverflow, it was showing something different. — KBnd, Aug 05 '16 at 22:54
No need for `re`, `line.startswith('>')` is totally sufficient. If you rearrange the two `if` blocks in one `if/else` block you can save one `re.match` which makes the code easier to read and faster. — Maximilian Peters, Aug 06 '16 at 12:18

score 0 · Answer 2 · edited May 23 '17 at 11:51

No need for import sys,subprocess in this snippet
You never close my_file, you can avoid this by using with open("gpcr.fasta") as my_file:
Every new sequence line overwrites the previous one since you open temp.fasta with 'w', you could either append the sequences by opening the file with 'a' or using the header as the filename.
You read the whole file at once with readlines(), that's convenient for small files but slow and consumes a lot of memory for larger files (see here for a detailed discussion). It is faster to iterate on the file object.

So here is the suggested code:

my_file = 'gpcr.fasta'
header = ''

with open(my_file, 'r') as my_gpcr:
    for line in my_gpcr:
        if line.startswith('>'):
            # check if we already passed the first line
            if header:
                tempf.close()
                print()
            header = line.strip()
            # open a new file for each sequence
            tempf = open('{}.fasta'.format(header[1:]), 'w')
            # remove if you want to skip the header
            tempf.write(header)
        else:
            # write the sequence line to the file
            # remove the strip() if you want to keep the line breaks
            tempf.write(line.strip())
            # end='\r' makes the sure the sequences are concatenated
            print(line.strip(), end='\r')

Note: This assumes a correctly formatted FASTA file.

Thank you so much, I really appreciate that. – KBnd Aug 08 '16 at 02:45 — KBnd, Aug 08 '16 at 02:45

score 0 · Answer 3 · answered Aug 11 '16 at 05:39

0

Is this all you're trying to do?

$ awk '/>/{if (NR>1) print ""; next} {printf "%s", $0} END{print ""}' file
MPANFTEGSFDSSG
MDNASFSEPWPANA

answered Aug 11 '16 at 05:39

Ed Morton

188,023
17
78
185

Unable to split the sequence from a Fasta File

3 Answers3