1

I'm analyzing an RNA sequence in which I need to reed the codons. I first need to split up my nucleotide string into a list of three pairs, but I have to give my function a reading_frame parameter that either starts reading the string at index 1, 2, or 3.

I made this code and do not know why it will not work, I get an incompletely read list whenever I do it for any index.

sequence = self.sequence.upper()
split_sequence = []
while len(sequence) >= 3:
    split_sequence.append(sequence[reading_frame:reading_frame + 3])
    reading_frame = reading_frame + 3
    sequence = sequence[reading_frame:]
return split_sequence

I also tried to use conditionals and regex but can't figure out how I would do the regex for the index(reading_frame) 1 and 2

if reading_frame == 0:
    split_sequence = re.findall(r'...', sequence)

if reading_frame == 1:
    split_sequence = re.findall(r'', sequence)

if reading_frame == 2:
    split_sequence = re.findall(r'', sequence)

Any ideas on how to fix these methods, or is there any easier way to do this? Thanks!

dl2257
  • 33
  • 3
  • 2
    Can you give an example of an input string and how you would want it split up? – user3030010 Dec 06 '16 at 22:59
  • Possible duplicate of [Split python string every nth character?](http://stackoverflow.com/questions/9475241/split-python-string-every-nth-character) – coder Dec 06 '16 at 23:16

4 Answers4

1

Here is a generator with a frame parameter:

def codons(seq,frame):
    n = len(seq)
    for i in range(frame - 1, n - 2, 3):
        yield seq[i:i+3]

For example:

test = 'ACTGCAGCATCAGCCATGCAACT'

for i in range(1,4):
    print(list(codons(test,i)))

Output:

['ACT', 'GCA', 'GCA', 'TCA', 'GCC', 'ATG', 'CAA']
['CTG', 'CAG', 'CAT', 'CAG', 'CCA', 'TGC', 'AAC']
['TGC', 'AGC', 'ATC', 'AGC', 'CAT', 'GCA', 'ACT']

As a generator, you can loop through codons as follows:

>>> for codon in codons(test,1): print(codon)

ACT
GCA
GCA
TCA
GCC
ATG
CAA

Note that the generator always yields whole codons of length 3. If a given reading frame ends with a fragment of length 1 or 2 it isn't returned by the generator. That behavior is by design, though it is easily modified to return final fragments if that is what you want.

John Coleman
  • 51,337
  • 7
  • 54
  • 119
0

Sorry not to fix your code but you could also use a list comprehension like this answer ...

sequence = "ABCBACCABBAC"
n = 3
starting_point = 2
[sequence[i:i+n] for i in range(starting_point, len(sequence), n)]
>>> ['CBA', 'CCA', 'BBA', 'C']

starting_point = 0
[sequence[i:i+n] for i in range(starting_point, len(sequence), n)]
>>> ['ABC', 'BAC', 'CAB', 'BAC']
Community
  • 1
  • 1
p-robot
  • 4,652
  • 2
  • 29
  • 38
0

What you want to do is read an iterable in chunks, as described here. If you are using Python 2, use this utility function:

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n):
        yield l[i:i + n]

This function returns a generator instead of a list. Use it like this:

s = 'ATCATGATTTATAGGHCFFDD'
codons = list(chunks(s, 3))  # ['ATC', 'ATG', 'ATT', 'TAT', 'AGG', 'HCF', 'FDD']
Community
  • 1
  • 1
bananafish
  • 2,877
  • 20
  • 29
0
sequence = sequence[reading_frame:]
split_sequence = [sequence[i:i+3] for i in range(0, len(sequence), 3)]
Thmei Esi
  • 434
  • 2
  • 9