Biopython Large Sequence splitting

Question

I'm a newbie in the field of python programming. As I was trying to do some analysis,(I've tried to find the answer on other posts, but nothing) I decided to post my first and probably very foolish question. Why does this create only one output file although in this example there were supposed to be at least 8 (sequence is more than 8000 characters). Thank you for your answer upfront.

def batch_iterator(iterator, batch_size) :
    entry = True
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = iterator.next()
            except StopIteration :
                entry = None
            if entry is None :
                #End of file
                break
            batch.append(entry)
        if batch :
            yield batch


from Bio import SeqIO
record_iter = SeqIO.parse(open("some.fasta"),"fasta")
for i, batch in enumerate(batch_iterator(record_iter, 1000)) :   #I think sth is wrong here?
    filename = "group_%i.fasta" % (i+1)
    handle = open(filename, "w")
    count = SeqIO.write(batch, handle, "fasta")
    handle.close()
    print "Wrote %i records to %s" % (count, filename)

What is the output of the `print "Wrote %i records to %s" % (count, filename)` statement? Do you see it only printed once? — logc, Apr 11 '14 at 11:59
And the fact that the whole `batch_iterator` function is not correctly indented is just a bad copy here on Stack Overflow, otherwise it wouldn't even start working ... right? — logc, Apr 11 '14 at 18:25
but what is wrong, Please explain it to me if you think Im capable of understanding, otherwise just pass i guess.. — sdgaw erzswer, Apr 11 '14 at 21:37
Well, the whole `batch_iterator` function should be indented (usually, 4 spaces), instead of being aligned with the `def batch_iterator` line where it is declared. — logc, Apr 11 '14 at 22:48
Correct me If I misunderstood, but this doesn't work beacuse everything from "from Bio import SeqIO" on isnt indented 4 spaces? — sdgaw erzswer, Apr 12 '14 at 09:54
No, it's exactly the opposite. Look, I edited the code to make it right. Now `batch_iterator` is correctly indented as a function. — logc, Apr 12 '14 at 10:08

score 0 · Accepted Answer · edited May 23 '17 at 10:33

Sequence chunks

After a long discussion with the OP, here is my very restructured proposal, using the generator function defined in this other SO thread

# file: main.py
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord


def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

if __name__ == '__main__':
    handle = open('long.fasta', 'r')
    records = list(SeqIO.parse(handle, "fasta"))
    record = records[0]

    for pos, chunk in enumerate(chunks(record.seq.tostring(), 1000)):
        chunk_record = SeqRecord(Seq(
            chunk, record.seq.alphabet),
            id=record.id, name=record.name,
            description=record.description)
        outfile = "group_%d.fasta" % pos
        SeqIO.write(chunk_record, open(outfile, 'w'), "fasta")

Note that your original code does something very different: it takes new records from the generator provided by the SeqIO.parse function, and tries to store them in different files. If you want to split a single record in smaller sub-sequences, you have to access the record's internal data, which is done by record.seq.tostring(). The chunks generator function, as described in the other thread linked above, returns as many chunks as is possible to build from the passed in sequence. Each of them is stored as a new fasta record in a different file (if you want to keep just the sequence, write the chunk directly to the opened outfile).

Check that it works

Consider the following code:

# file: generate.py
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio import SeqIO

long_string = "A" * 8000
outfile = open('long.fasta', 'w')

record = SeqRecord(Seq(
    long_string,
    IUPAC.protein),
    id="YP_025292.1", name="HokC",
    description="toxic membrane protein, small")

SeqIO.write(record, outfile, "fasta")

It writes a single record to a file named "long.fasta". This single record has a Sequence inside that is 8000 characters long, as generated in long_string.

How to use it:

$ python generate.py
$ wc -c long.fasta
8177 long.fasta

The overhead over 8000 characters is the file header.

How to split that file in chunks of 1000 length each, with the code snippet above:

$ python main.py
$ ls
generate.py   group_1.fasta group_3.fasta group_5.fasta group_7.fasta main.py
group_0.fasta group_2.fasta group_4.fasta group_6.fasta long.fasta
$ wc -c group_*
1060 group_0.fasta
1060 group_1.fasta
1060 group_2.fasta
1060 group_3.fasta
1060 group_4.fasta
1060 group_5.fasta
1060 group_6.fasta
1060 group_7.fasta
8480 total

I've tried and it doesn't work, the reason why I tried to do things with iteration is for effective memory use, I'm sure there is some very simple solution to that — sdgaw erzswer, Apr 11 '14 at 15:45
But `SeqIO.parse` is [already returning you a SeqRecord iterator](http://biopython.org/wiki/SeqIO#Sequence_Input). The fact that you grow the `batch` list on a separate iterator function or inside the main loop is not going to affect memory usage, AFAICT. — logc, Apr 11 '14 at 18:28
the code should create many files with 1000 units long sequences, instead it jsut duplicattes the whole thing. — sdgaw erzswer, Apr 11 '14 at 21:38
Wrote 1 records to group_1.fasta is the output it gives me although the fasta is way larger than just 1000 characters — sdgaw erzswer, Apr 12 '14 at 10:05
I just checked that it works. Is it possible that I misunderstood your expected output? Do you want to split the records, or the content of a single record? The code you have shown, as far as I can tell, is doing the former. — logc, Apr 12 '14 at 10:58
I am so so sorry. I tried to split the contents of a single file. For example: I've had a 90000 chars long fasta and I tried to split it into equally sized chunks. http://biopython.org/wiki/Split_large_file , this is what I tried but It still doesn't work :/ as you see I've tried to build the code from smaller blocks, but it still doesnt split my fasta into smaller ones. — sdgaw erzswer, Apr 13 '14 at 12:42
@user3523464: I have rearranged completely my answer because now I understand better what you are trying to accomplish. Don't apologize, I am sorry for my part that I did not understand well your question, possibly misleading you! :) — logc, Apr 13 '14 at 16:33
No problem. Could you please accept the answer, if it is indeed what you were looking for? Thanks! — logc, Apr 14 '14 at 12:04

Biopython Large Sequence splitting

1 Answers1

Sequence chunks

Check that it works