how to extend ambiguous dna sequence

Question

Let's say you have a DNA sequence like this :

AATCRVTAA

where R and V are ambiguous values of DNA nucleotides, where R represents either A or G and V represents A, C or G.

Is there a Biopython method to generate all the different combinations of sequences that could be represented by the above ambiguous sequence ?

Here for instance, the output would be :

AATCAATAA
AATCACTAA
AATCAGTAA
AATCGATAA
AATCGCTAA
AATCGGTAA

Jivan · Accepted Answer · 2014-12-18T17:58:46.567

Perhaps a little shorter and faster way, since by all odds this function is going to be used on very large data:

from Bio import Seq
from itertools import product

def extend_ambiguous_dna(seq):
   """return list of all possible sequences given an ambiguous DNA input"""
   d = Seq.IUPAC.IUPACData.ambiguous_dna_values
   return [ list(map("".join, product(*map(d.get, seq)))) ]

Using map allows your loops to be executed in C rather than in Python. This should prove much faster than using plain loops or even list comprehensions.

Field testing

With a simple dict as d instead of the one returned by ambiguous_na_values

from itertools import product
import time

d = { "N": ["A", "G", "T", "C"], "R": ["C", "A", "T", "G"] }
seq = "RNRN"

# using list comprehensions
lst_start = time.time()
[ "".join(i) for i in product(*[ d[j] for j in seq ]) ]
lst_end = time.time()

# using map
map_start = time.time()
[ list(map("".join, product(*map(d.get, seq)))) ]
map_end = time.time()

lst_delay = (lst_end - lst_start) * 1000
map_delay = (map_end - map_start) * 1000

print("List delay: {} ms".format(round(lst_delay, 2)))
print("Map delay: {} ms".format(round(map_delay, 2)))

Outputs:

# len(seq) = 2:
List delay: 0.02 ms
Map delay: 0.01 ms

# len(seq) = 3:
List delay: 0.04 ms
Map delay: 0.02 ms

# len(seq) = 4
List delay: 0.08 ms
Map delay: 0.06 ms

# len(seq) = 5
List delay: 0.43 ms
Map delay: 0.17 ms

# len(seq) = 10
List delay: 126.68 ms
Map delay: 77.15 ms

# len(seq) = 12
List delay: 1887.53 ms
Map delay: 1320.49 ms

Clearly map is better, but just by a factor of 2 or 3. It's certain it could be further optimised.

Due to changes in Biopython, maybe related to [this](https://biopython.org/wiki/Alphabet), this code needs changing to work. Change the first line to `import Bio.Data.IUPACData as bdi` and then change `d = Seq.IUPAC.IUPACData.ambiguous_dna_values` to `d = bdi.ambiguous_dna_values`. — Wayne, Jan 17 '22 at 01:24

score 2 · Answer 2 · answered Dec 18 '14 at 17:05

I eventually write my own function :

from Bio import Seq
from itertools import product

def extend_ambiguous_dna(seq):
   """return list of all possible sequences given an ambiguous DNA input"""
   d = Seq.IUPAC.IUPACData.ambiguous_dna_values
   r = []
   for i in product(*[d[j] for j in seq]):
      r.append("".join(i))
   return r 

In [1]: extend_ambiguous_dna("AV")
Out[1]: ['AA', 'AC', 'AG']

It allows you to generate every pattern for a given size with

In [2]: extend_ambiguous_dna("NN")

Out[2]: ['GG', 'GA', 'GT', 'GC',
         'AG', 'AA', 'AT', 'AC',
         'TG', 'TA', 'TT', 'TC',
         'CG', 'CA', 'CT', 'CC']

Hope this will save time to others !

score 0 · Answer 3 · answered Dec 18 '14 at 17:29

I'm not sure of a biopython way to do this, but here's one with itertools:

s = "AATCRVTAA"
ambig = {"R": ["A", "G"], "V":["A", "C", "G"]}
groups = itertools.groupby(s, lambda char:char not in ambig)
splits = []
for b,group in groups:
    if b:
        splits.extend([[g] for g in group])
    else:
        for nuc in group:
            splits.append(ambig[nuc])
answer = [''.join(p) for p in itertools.product(*splits)]

Output:

In [189]: answer
Out[189]: ['AATCAATAA', 'AATCACTAA', 'AATCAGTAA', 'AATCGATAA', 'AATCGCTAA', 'AATCGGTAA']

score 0 · Answer 4 · answered Dec 18 '14 at 17:46

One more itertools solution:

from itertools import product
import re

lu = {'R':'AG', 'V':'ACG'}

def get_seqs(seq):
    seqs = []
    nrepl = seq.count('R') + seq.count('V')
    sp_seq = [a for a in re.split(r'(R|V)', seq) if a]
    pr_terms = [lu[a] for a in sp_seq if a in 'RV']

    for cmb in product(*pr_terms):
        seqs.append(''.join(sp_seq).replace('R', '%s').replace('V', '%s') % cmb)
    return seqs

seq = 'AATCRVTAA'

print 'seq: ', seq
print '\n'.join(get_seqs(seq))

seq1 = 'RAATCRVTAAR'
print 'seq: ', seq1
print '\n'.join(get_seqs(seq1))

Output:

seq:  AATCRVTAA
AATCAATAA
AATCACTAA
AATCAGTAA
AATCGATAA
AATCGCTAA
AATCGGTAA
seq:  RAATCRVTAAR
AAATCAATAAA
AAATCAATAAG
AAATCACTAAA
AAATCACTAAG
AAATCAGTAAA
AAATCAGTAAG
AAATCGATAAA
AAATCGATAAG
AAATCGCTAAA
AAATCGCTAAG
AAATCGGTAAA
AAATCGGTAAG
GAATCAATAAA
GAATCAATAAG
GAATCACTAAA
GAATCACTAAG
GAATCAGTAAA
GAATCAGTAAG
GAATCGATAAA
GAATCGATAAG
GAATCGCTAAA
GAATCGCTAAG
GAATCGGTAAA
GAATCGGTAAG

wrong output in the special case where we have two or more same adjacent ambiguous codes like "RRATCGGTAAA" — Zingo, Feb 07 '17 at 08:20

how to extend ambiguous dna sequence

4 Answers4

Field testing

Output:

Linked