1

I want to write a function that takes a long string of characters (a protein sequence like 'UGGUGUUAUUAAUGGUUU') and extracts three characters at a time from it (i.e. the codons). It can either return each set of three characters one after another, or a list containing all the sets of three characters. Either way would work. But I'm having some trouble figuring out exactly how to do this cleanly.

Here's what I have so far:

def get_codon_list(codon_string):
    codon_start = 0
    codon_length = 3
    codon_end = 3
    codon_list = []
    for x in range(len(codon_string) // codon_length):
        codon_list.append(codon_string[codon_start:codon_end])
        codon_start += codon_length
        codon_end += codon_length
    return codon_list

It works to return a list of the codons, but it seems very inefficient. I don't like using hard-coded numbers and incrementing variables like that if there is a better way. I also don't like using for loops that don't actually use the variable in the loop. It doesn't seem like a proper use of it.

Any suggestions for how to improve this, either with a specific function/module, or just a better Pythonic technique?

Thanks!

John Salerno
  • 83
  • 2
  • 7
  • See https://stackoverflow.com/questions/5389507/iterating-over-every-two-elements-in-a-list – jonrsharpe Feb 15 '20 at 23:41
  • 2
    Related is this link: https://stackoverflow.com/questions/22571259/split-a-string-into-n-equal-parts/22571377 This method uses text wrapping – Robert McClardy Feb 15 '20 at 23:45
  • 1
    Does this answer your question? [Split string every nth character?](https://stackoverflow.com/questions/9475241/split-string-every-nth-character) – AMC Feb 16 '20 at 00:17

5 Answers5

4

You can use a list comprehension and get a slice of length 3 from the string at each time.

>>> s="UGGUGUUAUUAAUGGUUU"
>>> res = [s[i:i+3] for i in range(0,len(s),3)]
>>> res
['UGG', 'UGU', 'UAU', 'UAA', 'UGG', 'UUU']
abc
  • 11,579
  • 2
  • 26
  • 51
3

You can simply use the step argument of the range function to avoid maintaining the variables:

def get_codon_list(codon_string):
    codon_length = 3
    codon_list = []

    for codon_start in range(0, len(codon_string), codon_length):
        codon_end = codon_start + codon_length
        codon_list.append(codon_string[codon_start:codon_end])

    return codon_list

And then it can become as a list-comprehension:

def get_codon_list(codon_string):
    codon_length = 3

    codon_list = [codon_string[x:x+codon_length] for x in range(0, len(codon_string), codon_length)]

    return codon_list
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
2

The itertools grouper recipe is perfect for that (https://docs.python.org/3/library/itertools.html#itertools-recipes):

In [1]: from itertools import zip_longest

In [2]: def grouper(iterable, n, fillvalue=None):
   ...:     "Collect data into fixed-length chunks or blocks"
   ...:     # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
   ...:     args = [iter(iterable)] * n
   ...:     return zip_longest(*args, fillvalue=fillvalue)
   ...:

In [3]: list(grouper('UGGUGUUAUUAAUGGUUU', 3))
Out[3]:
[('U', 'G', 'G'),
 ('U', 'G', 'U'),
 ('U', 'A', 'U'),
 ('U', 'A', 'A'),
 ('U', 'G', 'G'),
 ('U', 'U', 'U')]
Randy
  • 14,349
  • 2
  • 36
  • 42
0

You might want to use a while loop here and increment the index by 3 each iteration, printing the next three letters, and exiting when the inedex is within 3 of the length

C-RAD
  • 1,052
  • 9
  • 18
0

With regular expression :

import re

def get_codon_list(codon_string):    
    return list(re.findall(r"(\w{3})", codon_string))
jsuisnul
  • 21
  • 3