Bioinformatics: matching list items with dictionary keys and printing the matching keys

Question

Homework assistance

I need to write a function which has the ability to take in a string containing DNA codons from a user e.g.

'ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAAC'

separate the string into groups of 3, then match each group with the dictionary items. but the program must only print out the keys, not the values.

input: ATTGHIATGTTTTTCTYU

separation:[ATT] [GHI] [ATG] [TTT] [TTC] [TYU]

output: IMFF

This is what I have so far

dna_codons = {'I': 'ATT' 'ATC' 'ATA',
              'L': 'CTT' 'CTC' 'CTA' 'CTG' 'TTA' 'TTG',
              'V': 'GTT' 'GTC' 'GTA' 'GTG',
              'F': 'TTT' 'TTC',
              'M': 'ATG',
              }
def translate(sequence):
    n = 3
    MyList = [sequence[i:i+n] for i in range(0, len(sequence), n)]
    for codon in MyList:
        for slc in dna_codons.keys():
            if codon == slc:
                print slc

print translate(raw_input('type in DNA sequence: '))

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

You can achieve the goal easier with list comprehensions and a generator to split input string by to chunks.

Try something like this:

in_seq = 'ATTGHIATGTTTTTCTYU'  # change this to input()

_codes = {  # your original dict is incorrect
    'ATT': 'I', 'ATC': 'I', 'ATA': 'I',
    'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L', 'TTA': 'L', 'TTG': 'L',
    'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',
    'TTT': 'F', 'TTC': 'F',
    'ATG': 'M',
}


def split_seq(s, n=2):
    """ split string to chunks of size n """
    i = 0
    while i < len(s):
        yield s[i:i + n]
        i += n

out_codes = [_codes[z.upper()] for z in split_seq(in_seq, 3) if z.upper() in _codes]
result = ''.join(out_codes)
print(result)

Output:

IMFF

If you want to see separated list, type print(list(split_seq(in_seq, 3))):

['ATT', 'GHI', 'ATG', 'TTT', 'TTC', 'TYU']

update

If you don't want to use generator, replace it with this ordinary function:

def split_seq(s, n=2):
    res = []
    i = 0
    while i < len(s):
        res.append(s[i:i + n])
        i += n
    return res

Thanks mate. I haven't learned about 'yield' as yet. That's new to me. What does that do? — Zee Dhlomo, May 24 '18 at 17:01
You're welcome:) `yield` is somewhat similar to `return`. It is used in special functions called generators. You can learn more about it here - https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do. — Ivan Vinogradov, May 24 '18 at 17:03
Updated an answer. You can replace a generator with an ordinary function which simple returns a `list` which represents a splitted input string. — Ivan Vinogradov, May 24 '18 at 17:09

score 0 · Answer 2 · answered May 24 '18 at 16:29

The main issue with your code is that 'I': 'ATT' 'ATC' 'ATA', won't work. The strings just get concatinated together (ATTATCATA). You need to turn those strings into lists: 'I': ['ATT', 'ATC', 'ATA'],. You can then use nested loops to iterate over the dictionary and the lists:

for slc in dna_codons.keys():
    for item in dna_codons[slc]:
        if codon == item:
            print slc

Finally, the print statement at the end will always print none because your function doesn't return anything to print. Ideally your function should return the desired output rather than printing it as a side effect:

aa_seq =''
for codon in MyList:  
    for slc in dna_codons.keys():
        for item in dna_codons[slc]:
            if codon == item:
                aa_seq += slc
return aa_seq

Of course, you're not getting much benefit from using a dictionary if you have to loop over all the values for every codon. It would be a lot more efficient to make the codons the keys and the amino acids the values. That way you could just use:

aa_seq = ''
for codon in MyList:
    aa_seq += dna_codons[codon]
return aa_seq

Faboor · Answer 3 · 2018-05-24T16:57:44.590

First, you have to change you dna_codons to have values as a list or a tuple. Currently, the strings of triples will just get concatenated into a single string.

dna_codons = {
    'I': ['ATT', 'ATC', 'ATA'],
    'L': ['CTT', 'CTC', 'CTA', 'CTG', 'TTA', 'TTG'],
    'V': ['GTT', 'GTC', 'GTA', 'GTG'],
    'F': ['TTT', 'TTC'],
    'M': ['ATG'],
}

Now you can use @heathobrien's nested loops, however those are quite inefficient. I think you should change the dictionary, so that it maps from codon to amino acid. You can do it with:

def transpose(d):
    out = {}
    for key, values in d.items():
        for val in values:
            out[val] = key
    return out

codon_to_aa = transpose(dna_codons)

This produces a dictionary {'ATG': 'M', 'ATT': 'I', 'ATC': 'I', ...}. After that, the rest is pretty straight forward. You just need to split the sequence and find appropriate mapping. Reusing your code:

def translate(sequence):
    n = 3
    codons = (sequence[i:i+n] for i in range(0, len(sequence), n))
    for codon in codons:
        print codon_to_aa.get(codon, ''),
    print

translate(raw_input('type in DNA sequence: '))

The comma after the first print makes sure that the next character gets printed out on the same line. The empty print will end the line. However, I'd suggest you aggregate the output into a variable and print it all at once.

Alternatively:

def translate(sequence):
    n = 3
    return ''.join(codon_to_aa.get(codon, '') for codon in   
                      (sequence[i:i + n] for i in xrange(0, len(sequence), n)))

print translate(raw_input('type in DNA sequence: '))

You have a typo in `def transponse(d):` – Ivan Vinogradov May 24 '18 at 16:45 — Ivan Vinogradov, May 24 '18 at 16:45

martineau · Answer 4 · 2018-05-24T17:17:14.720

Here's a way to do it using the itertools recipe for a generator function named grouper().

from itertools import zip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ...
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

DNA_CODONS = {
    'ATT': 'I', 'ATC': 'I', 'ATA': 'I',
    'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L', 'TTA': 'L', 'TTG': 'L',
    'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',
    'TTT': 'F', 'TTC': 'F',
    'ATG': 'M',
}

def translate(sequence, n=3):
    return [codeon for codeon in (''.join(nt) for nt in grouper(sequence, n, ' ')
            if codeon in DNA_CODONS)]

input_sequence = 'ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAAC'
print(translate(input_sequence))  # -> [['TTT'], ['GTG'], ['TTC'], ['CTC']]

Bioinformatics: matching list items with dictionary keys and printing the matching keys

4 Answers4

update