4

I am using python 2.7 I am working with a fasta file containing DNA sequence of modern human Y chromosome. Actually it is a long string of about 20000000 characters like ATCGACGATCACACG.... I want to convert this very long string to a list of triad strings, for example this string:

My_sequence_string= "ATGTACGTCATAG"

to this list:

My_sequence_list= ["ATG","TAC","GTC","ATA"]

This is my code:

str_Reading_Frame1=open("Ychromosome.fa", "r").read()
list_Reading_Frame1=[]
def str_to_list(list, str):
    if len(str)>2:
        list.append(str[:3])
        str_to_list(list, str[3:])
str_to_list(list_Reading_Frame1, str_Reading_Frame1)

But I see a memory limit error. I think that problem is calling the function inside it, but I don't know how to refine my code. I don't want to import modules, like Biopython, I wanna do it my self ( with your help :-) )

3 Answers3

2

I believe this line

str_Reading_Frame1=open("Ychromosome.fa", "r").read()

is the problem reading a huge string into memeory at once. And the recursion you are doing definitely doesn't help with performance. As well as the stack frames for each recursive call you are slicing a huge string N times which should be O(N^2) performance.

If you read 3 bytes at a time, as long as the list fits in memory, this is the most you can do other than not using a list and just iterating over 3 characters at a time which has also been suggested.

with open('Ychromosome.fa') as f:
    while True:
        triad = f.read(3)
        if len(triad) != 3:
            break
        My_sequence_list.append(triad)


>>> My_sequence_list
['ATG', 'TAC', 'GTC', 'ATA']
jamylak
  • 128,818
  • 30
  • 231
  • 230
2

You could easily use a generator function to avoid loading everything in memory.

def data(x):
    '''x if a file object and data returns an iterable giving blocs of 3 characters'''
    while True:
        d = x.read(3)
        if len(d) != 3:
            raise StopIteration
        yield d

with open("Ychromosome.fa", "r") as str_Reading_Frame1:
    for triad in data(str_Reading_Frame1):
        # use triad one at a time
        ...
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • thank you for your answer, but I understand jamylak's answer more easily than your's. this one is greater than my current python knowledge. Thank you again –  Dec 18 '14 at 15:35
  • 1
    @StackUnderFlow Your are welcome ... If you can load a full list in memory (and with recent systems you should be able to ...) do use jamylak's answer. This one is for even larger input files :-) – Serge Ballesta Dec 18 '14 at 15:38
  • @StackUnderFlow This is an extension of my answer that avoids having to have your entire list stored in memory. It depends whether you need to have it in memory or not, if you dont then use this – jamylak Dec 18 '14 at 23:52
1

I asked this question to write a code about obtaining codon usage of a DNA strand. jamylak answer helped me to refine my code and write my desired code. I write it fully, down here, because I think may be it is useful for some other people.

Bases=["A", "T", "C", "G"] #4 bases of DNA strands
#Generating 64 different codons
codons=[]
def Possible_Codons(Bases):
    for i in Bases:
        for j in Bases:
            for y in Bases:
                ins= "%s%s%s" % (i, j, y)
                codons.append(ins)
Possible_Codons(Bases)

#Generating 6 different reading frames
Code_file=open("3.fa", "r").read()
open("str_Reading_File1.fa", "w").write(Code_file)
open("str_Reading_File2.fa", "w").write(Code_file[1:])
open("str_Reading_File3.fa", "w").write(Code_file[2:])
open("str_Reading_File4.fa", "w").write(Code_file[::-1])
open("str_Reading_File5.fa", "w").write(Code_file[-2::-1])
open("str_Reading_File6.fa", "w").write(Code_file[-3::-1])
My_sequence_list=[]
numbers=["1", "2", "3", "4", "5", "6"] #It is used for calling files
for i in numbers:
    with open("str_Reading_File"+i+".fa") as f:
        while True:
            triad = f.read(3)
            if len(triad) != 3:
                break
            My_sequence_list.append(triad)
    print "In the reading frame "+i+", codon usage is:"
    for i in codons:
        print "%s = %s times" % (i, My_sequence_list.count(i))
    My_sequence_list=[]
    print "*****************\n"