9

For example, given the pattern

[a-zA-Z]_[0-9]{2}

a function would take the pattern and return an array or list containing

a_00, a_01, a_02, ... , a_98, a_99, ... , z_98, z_99

Only numbers and letters (and finite groupings thereof) need be expanded. How would I go about doing this? An example in Python or Perl would be preferred. Thank you!

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Ignis Umbrae
  • 207
  • 3
  • 7
  • 1
    That are 5200 matches. Sure you want to have it? And sorry, I can't do python or perl. – Dykam Aug 08 '09 at 10:41
  • I'd be cautious too. The number of possible combinations grows exponentially with respect to the regex lenght, so anything but short strings can be **very** time and memory consuming. – Agos Aug 08 '09 at 10:47
  • 11
    @etotheipi: instead of explaining your percieved solution to the problem, instead tell us what you are actually trying to achieve... – Mitch Wheat Aug 08 '09 at 10:48
  • @Mitch Wheat: This is in the top 10 of the "forever comments". I wish I could upvote it more than once. – Tomalak Aug 08 '09 at 10:50

3 Answers3

12

First, parse the expression and build a tree reflecting the syntactic structure of the regular expression, and include a terminator node that logically appears at the end. For example, in lisp notation, your example might look like this:

(concat
  (repeat 2
    (concat
      (charset
        (range a z)
        (range A Z))
      (literal _)
      (charset
        (range 0 9))))
  terminator)

Next, thread the tree so that you can use recursion to generate the combinatorial expansion. By that, I mean e.g. that nodes a..z in (concat a .. z) all need to have pointers from one to the next, so a points to b, and so on, and the concat node itself points to its successor. The idea is that you can produce one version of the current element in the expansion, and recurse into the next element, and when the next element returns you can try the next version of the current element, etc., until all versions are exhausted and you return to your caller (the predecessor or parent). This threading is easiest done with a stack and a post-order traversal of the tree. If you carefully push nodes during post-order traversal, the top of the stack will be the next element in sequence.

(An alternative to threading is to structure the tree so that the next element in every concat node is a child of the previous node, and repeat nodes' children point back to the repeat node.)

Then write a routine (or set of routines using pattern matching, or virtual methods if nodes in the tree are implemented using polymorphism in an object-oriented language) that, for any given node type, produces the correct output and recurses into the next node or children in the appropriate way. For example, in pseudocode:

if node is of form (repeat n): # non-variable repeat
    for i in 1 to n
        recurse into child
    recurse into threaded successor

if node is of form (concat ...):
    recurse into first element # last element will recurse into successor

if node is of form (literal x):
    emit x
    recurse into successor
    remove x

if node is of form (charset ...):
    for each element in charset:
        emit element
        recurse into successor
        remove element

if node is terminator:
    add string created thus far to final output list

etc. As you can observe, children of a repeat node mustn't recurse into the successor of the repeat node, so that needs to be taken into account when threading the tree. Similar care needs to be taken that "current progress" isn't lost when reaching the end of child nodes of a repeat node; alternatively, the successor of child nodes could point to the repeat node itself (i.e. a true closure over the graph of nodes), but that will require storing a counter somewhere.

All in all, completing this could take several days, depending on how flexible and performant it needs to be. Also note that certain constructs, such as Kleene star or closure (* or + in extended regex) will result in an infinite list. A language that supports generators (e.g. C#'s iterator syntax) or coroutines / continuations (e.g. Scheme's call/cc) or lazy evaluation (e.g. Haskell) can permit iteration over the first few elements of the list without having to evaluate the entire list. Alternatively, choosing random potential output rather than exhaustive iteration may be preferable for providing examples corresponding to a regular expression.

Barry Kelly
  • 41,404
  • 5
  • 117
  • 189
0

Use thrax!! Thrax provides command line tools to interact with the heavy-duty OpenFST toolkit. You can probably find it on package managers like apt or macports.

Here's an example of how to use thrax to sample a regex. Today I wanted to figure out a name for a library I am writing. I wanted the name to be an acronym for complete embedded? mrf? (optimization|inference) toolkit. After trying to come up with a name manually I gave up and wrote the following regex for possible acronyms: co?m?e?m?[io]to?.

So now the problem reduces to sampling some strings that satisfy this acronym. This is where thrax comes in. We can first write the regex in a grm file like this.

[Save following in file named tmp.grm]
sigma=("c"|"o"|"m"|"e"|"i"|"o"|"t");
export a= Optimize[ ("c":"c") (("o":"")|("o":"o")) (("m":"")|("m":"m")) (("e":"")|("e":"e")) (("m":"")|("m":"m")) (("o":"o")|("i":"i")) ("t":"t") (("o":"")|("o":"o")) ];

You can probably guess what's going on. For every x? I have specified that either x will be output or '' will be output. Now we can compile this grm file and sample strings from the language.

thraxcompiler --save_symbols --input_grammar=tmp.grm --output_far=tmp.far
thraxrandom-generator --far=tmp.far --rule=a --noutput=100

The output contains possible strings that the regex can match. I think there are also utilities for generating all possible outputs using thrax but if you just sample a large number of strings and uniq them that should be a good 90% solution.

****************************************
comemoto
cmemot
****************************************
comemoto
comoto
****************************************
comemoto
comot
****************************************
comemito
coemito
****************************************
comemoto
coeot
****************************************
comemoto
cmeot
****************************************
comemito
coit
****************************************
comemoto
cemot
****************************************
comemoto
ceoto
Pushpendre
  • 795
  • 7
  • 19
-1

Given that you can figure out the length of strings of matching lengths, you could brute force it by generating all possible strings with the correct lengths and then just try to match.

dala
  • 1,975
  • 3
  • 14
  • 15