Sequence chunks
After a long discussion with the OP, here is my very restructured proposal, using the generator function defined in this other SO thread
# file: main.py
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == '__main__':
handle = open('long.fasta', 'r')
records = list(SeqIO.parse(handle, "fasta"))
record = records[0]
for pos, chunk in enumerate(chunks(record.seq.tostring(), 1000)):
chunk_record = SeqRecord(Seq(
chunk, record.seq.alphabet),
id=record.id, name=record.name,
description=record.description)
outfile = "group_%d.fasta" % pos
SeqIO.write(chunk_record, open(outfile, 'w'), "fasta")
Note that your original code does something very different: it takes new records from the generator provided by the SeqIO.parse
function, and tries to store them in different files. If you want to split a single record in smaller sub-sequences, you have to access the record's internal data, which is done by record.seq.tostring()
. The chunks
generator function, as described in the other thread linked above, returns as many chunks as is possible to build from the passed in sequence. Each of them is stored as a new fasta record in a different file (if you want to keep just the sequence, write the chunk
directly to the opened outfile
).
Check that it works
Consider the following code:
# file: generate.py
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio import SeqIO
long_string = "A" * 8000
outfile = open('long.fasta', 'w')
record = SeqRecord(Seq(
long_string,
IUPAC.protein),
id="YP_025292.1", name="HokC",
description="toxic membrane protein, small")
SeqIO.write(record, outfile, "fasta")
It writes a single record to a file named "long.fasta". This single record has a Sequence inside that is 8000 characters long, as generated in long_string
.
How to use it:
$ python generate.py
$ wc -c long.fasta
8177 long.fasta
The overhead over 8000 characters is the file header.
How to split that file in chunks of 1000 length each, with the code snippet above:
$ python main.py
$ ls
generate.py group_1.fasta group_3.fasta group_5.fasta group_7.fasta main.py
group_0.fasta group_2.fasta group_4.fasta group_6.fasta long.fasta
$ wc -c group_*
1060 group_0.fasta
1060 group_1.fasta
1060 group_2.fasta
1060 group_3.fasta
1060 group_4.fasta
1060 group_5.fasta
1060 group_6.fasta
1060 group_7.fasta
8480 total