How to split up a string every every three indices either starting at index 0, 1, or 2?

Question

I'm analyzing an RNA sequence in which I need to reed the codons. I first need to split up my nucleotide string into a list of three pairs, but I have to give my function a reading_frame parameter that either starts reading the string at index 1, 2, or 3.

I made this code and do not know why it will not work, I get an incompletely read list whenever I do it for any index.

sequence = self.sequence.upper()
split_sequence = []
while len(sequence) >= 3:
    split_sequence.append(sequence[reading_frame:reading_frame + 3])
    reading_frame = reading_frame + 3
    sequence = sequence[reading_frame:]
return split_sequence

I also tried to use conditionals and regex but can't figure out how I would do the regex for the index(reading_frame) 1 and 2

if reading_frame == 0:
    split_sequence = re.findall(r'...', sequence)

if reading_frame == 1:
    split_sequence = re.findall(r'', sequence)

if reading_frame == 2:
    split_sequence = re.findall(r'', sequence)

Any ideas on how to fix these methods, or is there any easier way to do this? Thanks!

Can you give an example of an input string and how you would want it split up? — user3030010, Dec 06 '16 at 22:59
Possible duplicate of [Split python string every nth character?](http://stackoverflow.com/questions/9475241/split-python-string-every-nth-character) — coder, Dec 06 '16 at 23:16

John Coleman · Answer 1 · 2016-12-06T23:24:11.783

Here is a generator with a frame parameter:

def codons(seq,frame):
    n = len(seq)
    for i in range(frame - 1, n - 2, 3):
        yield seq[i:i+3]

For example:

test = 'ACTGCAGCATCAGCCATGCAACT'

for i in range(1,4):
    print(list(codons(test,i)))

Output:

['ACT', 'GCA', 'GCA', 'TCA', 'GCC', 'ATG', 'CAA']
['CTG', 'CAG', 'CAT', 'CAG', 'CCA', 'TGC', 'AAC']
['TGC', 'AGC', 'ATC', 'AGC', 'CAT', 'GCA', 'ACT']

As a generator, you can loop through codons as follows:

>>> for codon in codons(test,1): print(codon)

ACT
GCA
GCA
TCA
GCC
ATG
CAA

Note that the generator always yields whole codons of length 3. If a given reading frame ends with a fragment of length 1 or 2 it isn't returned by the generator. That behavior is by design, though it is easily modified to return final fragments if that is what you want.

score 0 · Answer 2 · edited May 23 '17 at 11:45

0

Sorry not to fix your code but you could also use a list comprehension like this answer ...

sequence = "ABCBACCABBAC"
n = 3
starting_point = 2
[sequence[i:i+n] for i in range(starting_point, len(sequence), n)]
>>> ['CBA', 'CCA', 'BBA', 'C']

starting_point = 0
[sequence[i:i+n] for i in range(starting_point, len(sequence), n)]
>>> ['ABC', 'BAC', 'CAB', 'BAC']

edited May 23 '17 at 11:45

Community

1
1

answered Dec 06 '16 at 23:13

p-robot

4,652
2
29
38

score 0 · Answer 3 · edited May 23 '17 at 12:08

What you want to do is read an iterable in chunks, as described here. If you are using Python 2, use this utility function:

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n):
        yield l[i:i + n]

This function returns a generator instead of a list. Use it like this:

s = 'ATCATGATTTATAGGHCFFDD'
codons = list(chunks(s, 3))  # ['ATC', 'ATG', 'ATT', 'TAT', 'AGG', 'HCF', 'FDD']

Thmei Esi · Accepted Answer · 2016-12-06T23:28:11.370

0

sequence = sequence[reading_frame:]
split_sequence = [sequence[i:i+3] for i in range(0, len(sequence), 3)]

edited Dec 06 '16 at 23:28

answered Dec 06 '16 at 23:22

Thmei Esi

434
2
9

How to split up a string every every three indices either starting at index 0, 1, or 2?

4 Answers4