1

I have to make a generic parser for parsing fasta files using Python.

The format is like:

>gi|348686675|gb|JH159151.1| Phytophthora sojae unplaced genomic scaffold PHYSOscaffold_1, whole genome shotgun sequence  
    TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA

>gi|348686675|gb|JH159151.1| Phytophthora sojae unplaced genomic scaffold PHYSOscaffold_2, whole genome shotgun sequence
CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA
AGTTTATATATAAATTTCCTTTTTATTGGA

>gi|348686675|gb|JH159151.1| Phytophthora sojae unplaced genomic scaffold PHYSOscaffold_3, whole genome shotgun sequence
GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA

I have to retrieve each title and sequence separately and insert the values in my created MySQL database.

eg: title1 = PHYSOscaffold_1
    sequence2 = TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA
    title2 = PHYSOscaffold_2
    sequence1 = CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA AGTTTATATATAAATTTCCTTTTTATTGGA

and so on... I the insert these values into a MySQL table.

The output of my parse should be like:

name1 \t sequence1 \t length_of_sequence \t a_count \t t_count \t g_count \t c_count

name2 \t sequence2 \t length_of_sequence \t a_count \t t_count \t g_count \t c_count

So far, I have written a very basic script like this:

infile = open("simple.fasta")
line = infile.readline()
if not line.startswith(">"):
raise TypeError("Not a FASTA file: %r" % line)
title = line
sequence_lines = []
while 1:
  line = infile.readline().rstrip()
  if line == "":
    break
  sequence_lines.append(line)

I am only getting my first sequence and title.

I am a novice and need expert help.

Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
aki2all
  • 429
  • 1
  • 8
  • 24
  • Is there any specific reason why you don't use the routines in BioPython's [SeqIO](http://biopython.org/wiki/SeqIO#File_Formats)? – chthonicdaemon May 25 '13 at 07:03
  • possible duplicate of [parsing a fasta file using a generator ( python )](http://stackoverflow.com/questions/7654971/parsing-a-fasta-file-using-a-generator-python) – Burhan Khalid May 25 '13 at 07:10
  • I cannot use Biopython module due to strict instructions from my guide. @chthonicdaemon – aki2all May 25 '13 at 08:53

1 Answers1

0

The reason you're only getting the first title and sequence is because the lines between each read are blank. So when you do:

if line == "":
    break

It will break after the first sequence. Using readline() there's no way to detect the end of a file, as it will just return ''.

This is an inelegant solution to the problem:

infile = open("simple.fasta")
# State variable so we can handle the start of the file properly
# There are probably much better ways to do this.
start = True

# Its much better to iterate over the lines than to use a while 1 loop.
for line in infile.readlines():
    if line.startswith(">"):
        if start:
            start = False
        else:
            # Each time we get here we have complete information for a read
            # You can then store that read in your database.
        sequence_lines = []
        title = line
    else:
        if start:
            raise TypeError("Not a FASTA file: %r" % line)
            start = FALSE
        sequence_lines.append(line)
Scott Ritchie
  • 10,293
  • 3
  • 28
  • 64