1

I have a regular expression: ATG(C|G|A)(C|T)GA

The above regular expression could take any form with only OR (|) special characters at any position in the string and any number of alphabets within the brackets.

I want to match all combinations of this string in a list:

ATGCCGA
ATGCTGA
ATGGCGA
ATGGTGA
ATGACGA
ATGATGA

I am unable to find any python library that could do this.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

2 Answers2

4

You could take the cartesian product of the dynamic parts of the string using itertools.product then join with the other static parts of the string.

>>> from itertools import product
>>> [f'ATG{i}{j}GA' for i,j in product('CGA', 'CT')]
['ATGCCGA', 'ATGCTGA', 'ATGGCGA', 'ATGGTGA', 'ATGACGA', 'ATGATGA']
Cory Kramer
  • 114,268
  • 16
  • 167
  • 218
1

You can use recursion:

import collections
s = 'ATG(C|G|A)(C|T)GA'
def combos(d):
   r, k = [], None
   while d:
      if (c:=d.popleft()) not in '|()':
         k = (k if k else '')+c
      elif c == '|':
         if k:
            r.append(k)
         k = None
      elif c == '(':
         r = [v+(k or '')+i for i in combos(d) for v in (r if r else [''])]
         k = None
      else:
         if k:
            r.append(k)
         k = None
         break
   yield from ([i+(k or '') for i in r] if r else [k])

print(list(combos(collections.deque(list(s)))))

Output:

['ATGCCGA', 'ATGGCGA', 'ATGACGA', 'ATGCTGA', 'ATGGTGA', 'ATGATGA']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102