0

I'm trying to do a code in python that, given a regular expression from a given alphabet, will come up with all possible alternatives with similar degrees of freedom. For example, if my alphabet is ACTG (DNA nucleotides), and my regular expression is [AG]CG (A regular expression that covers ACG or GCG) I would want to output [AC]CG (A regular expression that covers ACG or CCG), [AT]CG (A regular expression that covers ACG or TCG), [AG]CC, etc

Problem is, I'm very new to python or programming in general and still haven't figured out a way to do this. The final goal is to find if there is a certain bias toward a degenerate sequence (the regular expression) in a given string (a DNA transcript), by seeing if the average of appearances of all other similar degenrate sequences is indeed smaller than the number of appearances of that specific degenerate sequence.

Thanks for any help or hint,

Eyal

Knowname
  • 95
  • 7

1 Answers1

0

Thanks for the comments, I managed to do this manually for a specific RegEx for now (until I improve my python skills) using this code (for the RegEx [AGT][AG]AC[ACT]):

import itertools


def create_pots():
    af = []
    bf = []
    cf = []
    df = []
    ef = []
    gf = []
    a = list(itertools.combinations('AGCT', 3))
    b = list(itertools.combinations('AGCT', 2))
    c = list(itertools.combinations('AGCT', 1))
    d = list(itertools.combinations('AGCT', 1))
    e = list(itertools.combinations('AGCT', 3))
    for i in range(len(a)):
        af.append('['+ ''.join(a[(i-1)]) + ']')
    for i in range(len(b)):
        bf.append('['+''.join(b[(i-1)])+']')
    for i in range(len(c)):
        cf.append(''.join(c[(i-1)]))
    for i in range(len(d)):
        df.append(''.join(d[(i-1)]))
    for i in range(len(e)):
        ef.append('['+''.join(e[(i-1)])+']')
    g = list(itertools.product(af, bf, cf, df, ef))
    for i in range(len(g)):
        gf.append(''.join(g[(i-1)]))
    gf.remove('[AGT][AG]AC[ACT]')
    return gf

This returns a list of all possible RegExs similar to mine like:

gf = ['[ACT][GT]CC[ACT]', '[GCT][CT]TT[GCT]', '[GCT][CT]TT[AGC]', '[GCT][CT]TT[AGT]', '[GCT][CT]TT[ACT]', '[GCT][CT]TA[GCT]', '[GCT][CT]TA[AGC]', '[GCT][CT]TA[AGT]', '[GCT][CT]TA[ACT]', '[GCT][CT]TG[GCT]', '[GCT][CT]TG[AGC]', '[GCT][CT]TG[AGT]'....]
Knowname
  • 95
  • 7