0

I have a DNA sequence which is variable only at specific locations and need to find all possible scenarios:

DNA_seq='ANGK' #N can be T or C  and K can be A or G
N=['T','C']
K=['A','G']

Results:

['ATGA','ATGG','ACGA','ACGG']

The offered solution by @vladimir works perfectly for simple cases like the example above but for complicated scenarios as below runs quickly out of memory. For the example below, even running with 120G of memory ended with out-of-memory error. This is surprising because the total number of combinations would be around 500K of 33bp strings which I assume should not consume more than 100G of RAM. Are my assumptions wrong? Any suggestions?

N=['A','T','C','G']
K=['G','T']
dev_seq=[f'{N1}{N2}{K1}{N3}{N4}{K2}{N5}{N6}{K3}TCC{N7}{N8}{K4}CTG{N9}{N10}{K5}CTG{N11}{N12}{K6}{N13}{N14}{K7}{N15}{N16}{K8}' for \
           N1,N2,K1,N3,N4,K2,N5,N6,K3,N7,N8,K4,N9,N10,K5,N11,N12,K6,N13,N14,K7,N15,N16,K8 in \
               product(N,N,K,N,N,K,N,N,K,N,N,K,N,N,K,N,N,K,N,N,K,N,N,K)]
Masih
  • 920
  • 2
  • 19
  • 36

1 Answers1

5

Use itertools.product:

from itertools import product
result = [f'A{n}G{k}' for n, k in product(N, K)]

Result:

['ATGA', 'ATGG', 'ACGA', 'ACGG']

EDIT

If you don't want to store the whole list in memory at one time, and would rather process the strings sequentially as they come, you can use a generator:

g = (f'A{n}G{k}' for n, k in product(N, K))
Vladimir Fokow
  • 3,728
  • 2
  • 5
  • 27
  • thanks for solution. but this solution won't scale on the complicates scenarios. I provided an example in the question. Do you have any suggestions? – Masih Sep 02 '22 at 03:32
  • 1
    @Masih added how to create a generator. Note that the number of strings in your new input is around 1 trillion, so in order not to run out of memory you can't store all them at the same time - so you'll have to process them as they come – Vladimir Fokow Sep 02 '22 at 03:56