0

my fasta file has multiple sequences like the following

>header1
MPANFTE
GSFDSSG
>header2
MDNASFS
EPWPANA

so I am writing a code to remove the headers and the output looks like this in a temporary file:

MPANFTEGSFDSSG

MDNASFSEPWPANA

So far, I have come up with this code: but it is not giving me the exact output.

import sys,subprocess

# open the file
my_file = open("gpcr.fasta")
# read the contents
my_gpcr = my_file.readlines()
for line in my_gpcr:
    if '>' == line[0]:
        header = line
    else:   
        tempf = open('temp.fasta', 'w')
        tempf.write(header)
        tempf.write(line)
        tempf.close()
        print line
Community
  • 1
  • 1
KBnd
  • 21
  • 5

3 Answers3

0

I'm assuming that what you want to do is to print a file containing only one line per sequence? You can use an awk script:

awk 'BEGIN{newline=0}{if(/^>/){if(newline==1){print ""} newline=1}else printf $i}END{print }' gpcr.fasta

If you want to use Python, anyway, this is how I would do it:

import sys, re

# open the file                                                                                            
my_file = open("gpcr.fasta")

newline = 0
for line in my_file:
    line = line.rstrip('\n')
    if re.match('>', line):
        # after we read the first header, we will print a new line when we see one to separate concatenated sequences                                                                                                   
        if newline == 1:
            print
        newline = 1
    else:
        sys.stdout.write(line)

print

Also, a header starting with '>' is not a real FASTA header. Only > should be at the beginning of the line.

Casper S
  • 49
  • 7
  • Thank you so much, now it's working. And yes it doesn't start with '<' , I was unable to type this in stackoverflow, it was showing something different. – KBnd Aug 05 '16 at 22:54
  • No need for `re`, `line.startswith('>')` is totally sufficient. If you rearrange the two `if` blocks in one `if/else` block you can save one `re.match` which makes the code easier to read and faster. – Maximilian Peters Aug 06 '16 at 12:18
0
  • No need for import sys,subprocess in this snippet
  • You never close my_file, you can avoid this by using with open("gpcr.fasta") as my_file:
  • Every new sequence line overwrites the previous one since you open temp.fasta with 'w', you could either append the sequences by opening the file with 'a' or using the header as the filename.
  • You read the whole file at once with readlines(), that's convenient for small files but slow and consumes a lot of memory for larger files (see here for a detailed discussion). It is faster to iterate on the file object.

So here is the suggested code:

my_file = 'gpcr.fasta'
header = ''

with open(my_file, 'r') as my_gpcr:
    for line in my_gpcr:
        if line.startswith('>'):
            # check if we already passed the first line
            if header:
                tempf.close()
                print()
            header = line.strip()
            # open a new file for each sequence
            tempf = open('{}.fasta'.format(header[1:]), 'w')
            # remove if you want to skip the header
            tempf.write(header)
        else:
            # write the sequence line to the file
            # remove the strip() if you want to keep the line breaks
            tempf.write(line.strip())
            # end='\r' makes the sure the sequences are concatenated
            print(line.strip(), end='\r')

Note: This assumes a correctly formatted FASTA file.

Community
  • 1
  • 1
Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
0

Is this all you're trying to do?

$ awk '/>/{if (NR>1) print ""; next} {printf "%s", $0} END{print ""}' file
MPANFTEGSFDSSG
MDNASFSEPWPANA
Ed Morton
  • 188,023
  • 17
  • 78
  • 185