0

I should start out by saying that I'm as new as it gets to both Python and Biopython. I'm trying to split a large .fasta file (with multiple entries) into single files, each with a single entry. I found most of the following code on the Biopython wiki/ Cookbook site, and adapted it just a bit. My problem is that this generator names them as "1.fasta", "2.fasta", etc. and I need them named by some identifier such as GI number.

 def batch_iterator(iterator, batch_size) :
    """Returns lists of length batch_size.

    This can be used on any iterator, for example to batch up
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch
    Alignment objects from Bio.AlignIO.parse(...), or simply
    lines from a file handle.

    This is a generator function, and it returns lists of the
    entries from the supplied iterator.  Each list will have
    batch_size entries, although the final list may be shorter.
    """
    entry = True #Make sure we loop once
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = next(iterator)
            except StopIteration :
                entry = None
            if entry is None :
                #End of file
                break
            batch.append(entry)
        if batch :
            yield batch

from Bio import SeqIO
infile = input('Which .fasta file would you like to open? ')
record_iter = SeqIO.parse(open(infile), "fasta")
for i, batch in enumerate(batch_iterator(record_iter, 1)) :
    outfile = "c:\python32\myfiles\%i.fasta" % (i+1)
    handle = open(outfile, "w")
    count = SeqIO.write(batch, handle, "fasta")
    handle.close()

If I try to replace:

outfile = "c:\python32\myfiles\%i.fasta" % (i+1)

with:

outfile = "c:\python32\myfiles\%s.fasta" % (record_iter.id)

so that it will name something similar to seq_record.id in SeqIO, it gives the following error:

    Traceback (most recent call last):
  File "C:\Python32\myscripts\generator.py", line 33, in [HTML]
    outfile = "c:\python32\myfiles\%s.fasta" % (record_iter.id)
AttributeError: 'generator' object has no attribute 'id'

Although the generator function has no attribute 'id', can I get around this somehow? Is this script too complicated for what I'm trying to do?!? Thanks, Charles

Seki
  • 11,135
  • 7
  • 46
  • 70
user1426421
  • 81
  • 1
  • 1
  • 8

1 Answers1

2

Because you only want one record at a time, you can ditch the batch_iterator wrapper and the enumeration:

for seq_record in record_iter:

And then what you want is the id property of each record, not the iterator as a whole:

for seq_record in record_iter:
    outfile = "c:\python32\myfiles\{0}.fasta".format(seq_record.id)
    handle = open(outfile, "w")
    count = SeqIO.write(seq_record, handle, "fasta")
    handle.close()

For your reference, the generator error is a result of the fact that you are trying to get the property id from the object record_iter. record_iter is not a single record, but a set of records, and they are held as a Python generator, which is kind of like a list-in-progress, so that you don't have to read the entire file in at once and memory usage is more efficient. More on generators: What can you use Python generator functions for? , http://docs.python.org/tutorial/classes.html#generators ,

Community
  • 1
  • 1
Karmel
  • 3,452
  • 2
  • 18
  • 11
  • Seems like the best and simplest way. Opening the output file would be cleaner with `with open(outfile, "w") as handle:` – weronika May 30 '12 at 19:08
  • Or instead of doing the open in your code, get Biopython to do it: count = SeqIO.write(seq_record, outfile, "fasta") – Peter Cock Aug 29 '12 at 13:26