0

I have many rna sequences of the same length. Now i want to create a function that will give me one line of ambiguous rna as output. So far i'm not finding any useful information on writing ambiguous sequences online.

I thought about using a dictionary like this:

d = {"N": ["A", "C", "G", "U"],
 "R": ["A", "G"],
 "Y": ["U", "C"],
 "K": ["G", "U"],
 "M": ["A", "C"],
 "B": ["C", "G", "U"],
 "D": ["A", "G", "U"],
 "H": ["A", "C", "U"],
 "V": ["A", "C", "G"]}

i have no idea on how i could put it to use in the right way since i'm a beginner.

 test = ['GUUUUUCAUUUA', 'GUUUUUCAUUUG', 'GUUUUUCAUCUU', 'GUUUUUCAUCUC', 
'GUUUUUCAUCUA', 'GUUUUUCAUCUG', 'GUUUUUCACUUA', 'GUUUUUCACUUG', 
'GUUUUUCACCUU', 'GUUUUUCACCUC', 'GUUUUUCACCUA', 'GUUUUUCACCUG', 
'GUUUUCCAUUUA', 'GUUUUCCAUUUG', 'GUUUUCCAUCUU', 'GUUUUCCAUCUC', 
'GUUUUCCAUCUA', 'GUUUUCCAUCUG', 'GUUUUCCACUUA', 'GUUUUCCACUUG', 
'GUUUUCCACCUU', 'GUUUUCCACCUC', 'GUUUUCCACCUA', 'GUUUUCCACCUG', 
'GUCUUUCAUUUA', 'GUCUUUCAUUUG', 'GUCUUUCAUCUU', 'GUCUUUCAUCUC', 
'GUCUUUCAUCUA', 'GUCUUUCAUCUG', 'GUCUUUCACUUA', 'GUCUUUCACUUG', 
'GUCUUUCACCUU', 'GUCUUUCACCUC', 'GUCUUUCACCUA', 'GUCUUUCACCUG', 
'GUCUUCCAUUUA', 'GUCUUCCAUUUG', 'GUCUUCCAUCUU', 'GUCUUCCAUCUC', 
'GUCUUCCAUCUA', 'GUCUUCCAUCUG', 'GUCUUCCACUUA', 'GUCUUCCACUUG', 
'GUCUUCCACCUU', 'GUCUUCCACCUC', 'GUCUUCCACCUA', 'GUCUUCCACCUG', 
'GUAUUUCAUUUA', 'GUAUUUCAUUUG']
  • 5
    Welcome to Stack Overflow! See: [How do I do X?](https://meta.stackoverflow.com/questions/253069/whats-the-appropriate-new-current-close-reason-for-how-do-i-do-x) The expectation on Stack Overflow is that the user asking a question not only does research to answer their own question but also shares that research, code attempts, and results. This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and most of all it helps you get a more specific and relevant answer! See also: [How to Ask](https://stackoverflow.com/help/how-to-ask) – undetected Selenium Dec 24 '18 at 14:38

1 Answers1

0

First you want to see which nucleotides display ambiguity. The first nucleotide of all your test sequences is a G, so no ambiguity there. But the third nucleotide of all your test sequences can be either an A, C or U.

We can see this programatically by doing map(set, zip(*test)) (making use of the unpacking operator *):

>>> list(map(set, zip(*test)))
[{'G'},
 {'U'},
 {'A', 'C', 'U'},
 {'U'},
 {'U'},
 {'C', 'U'},
 {'C'},
 {'A'},
 {'C', 'U'},
 {'C', 'U'},
 {'U'},
 {'A', 'C', 'G', 'U'}]

OK, so for the third nucleotide, where we can have either A, C or U, we want to replace it with H. But in your dictionary, the H is the key and the ['A', 'C', 'U'] are the values. So we have to invert the dictionary. Because we can't use lists as key, we turn them into strings. And we also sort them so that we always have the same order.

>>> d = {''.join(sorted(v)): k  for (k, v) in d.items()}
>>> d
{'ACGU': 'N',
 'AG': 'R',
 'CU': 'Y',
 'GU': 'K',
 'AC': 'M',
 'CGU': 'B',
 'AGU': 'D',
 'ACU': 'H',
 'ACG': 'V'}

and now we can write your function:

>>> def give_ambiguous_rna(sequences):
        return ''.join([list(s)[0] if len(s) == 1 else d[''.join(sorted(s))] for s in map(set, zip(*sequences))])
>>> give_ambiguous_rna(test)
'GUHUUYCAYYUN'
BioGeek
  • 21,897
  • 23
  • 83
  • 145