0

I am trying to randomly generate a string of n length from 5 characters ('ATGC '). I am currently using itertools.product, but it is incredibly slow. I switched to itertools.combinations_with_replacement, but it skips some values. Is there a faster way of doing this? For my application order does matter.

for error in itertools.product('ATGC ', repeat=len(errorPos)):
    print(error)
    for ps in error:
        for pos in errorPos:
            if ps == " ":
                fseqL[pos] = ""
            else:
                fseqL[pos] = ps
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • What are the inner loops for? – John Kugelman Jan 17 '22 at 00:23
  • I am using this algorithm to error correct a DNA strand. Those inner loops are for inserting the result from the interable into the strand. – AWESDUDE COOL Jan 17 '22 at 00:24
  • If I reversed the results from `itertools.combinations` and used those, would that give me all the possibilities? – AWESDUDE COOL Jan 17 '22 at 00:36
  • 1
    Can't you use [this](https://stackoverflow.com/a/27552377/8508004) and just give it a sequence of Ns of the length you need? (Came from [Biostars post 'all posible sequences from consensus'](https://www.biostars.org/p/282490/).) – Wayne Jan 17 '22 at 00:38

1 Answers1

3

If you just want a random single sequence:

import random
def generate_DNA(N):
    possible_bases ='ACGT'
    return ''.join(random.choice(possible_bases) for i in range(N))
one_hundred_bp_sequence = generate_DNA(100)

That was posted before post clarified spaces need; you can change possible_sequences to include a space if you need spaces allowed.


If you want all combinations that allow a space, too, a solution adapted from this answer, which I learned of from Biostars post 'all possible sequences from consensus':

from itertools import product

def all_possibilities_w_space(seq):
   """return list of all possible sequences given a completely ambiguous DNA input. Allow spaces"""
   d = {"N":"ACGT "}
   return  list(map("".join, product(*map(d.get, seq)))) 
all_possibilities_w_space("N"*2) # example of length two

The idea being N can be any of "ACGT " and the multiple specifies the length. The map should specify C is used to make it faster according to the answer I adapted it from.

Wayne
  • 6,607
  • 8
  • 36
  • 93