Regular expression: language generator

Question

Given a regular expression in C#, is there a way to generate a word that is accepted by this regular expression?

For instance, let's consider:

[ab]c*b*

Is there a function that can automatically generate a enumeration like:

a
b
ac
ab
bc
bb
acb
bcb
acc
bcc
...

Obviously this list being infinite of potentially of as-long-as-you-want words, the generator would have to be smart in order to output things from the simplest to the most complex, without being trapped in infinite loops.

I think this would be a useful tool in order to validate regular expression. In general it's easy to see that a regular expression accepts words that you planned it would accept. It's usually much more difficult to see what other words it would accept.

EDIT: This question is not about how to do it, but rather: is there anything out there that I could use to do it in C#?

Looking to solve the [halting problem](http://en.wikipedia.org/wiki/Halting_problem)? — Oded, Mar 02 '12 at 15:09
Regexes are not turing complete. edit: Regular expressions in general are not turing complete. If C# lets you write turing-complete ones, then yes, that's a problem and those features will have to be forbidden. — zmccord, Mar 02 '12 at 15:16
Oh, I see this is also a partial dupe with http://stackoverflow.com/questions/4208733/generative-regular-expressions — zmccord, Mar 02 '12 at 15:19
@Oded: no, this is about professional environment, I am building an application that can be driven by a small language that is parsed with regex. I want to check that I am not accepting too many words, and I would like to use such a generator to create tests. — edeboursetty, Mar 02 '12 at 15:49
@zmccord: In fact, my question is about knowing if someone already did it for C# regex. — edeboursetty, Mar 02 '12 at 15:52
Ah, my mistake… I misread. But any tool that accepts your regexes would serve your purpose, right? Doesn't strictly have to be in C#? — zmccord, Mar 02 '12 at 16:00
@zmccord: well for me I don't have anything else than C# to work with, so that would be painful to use something else. Now, if you know about tools in other languages, that can be interesting for other people too. — edeboursetty, Mar 02 '12 at 16:05
There is a Perl script I remember using while working on SpamAssassin years ago. Wonder if it still exists... The old uri is no longer answering :(... Look for Regex Expander... — jessehouwing, Mar 02 '12 at 18:09
@Oded: Can you explain why you think this has anything to do with the halting problem? It is an elementary exercise to take a discrete finite automaton and find a path, if one exists, from the start state to the goal state. (This assumes of course that the regular expression is actually *regular*. If it is one of those so-called "regular" expressions that actually requires a pushdown automaton then that gets a little harder.) — Eric Lippert, Mar 02 '12 at 18:32
@EricLippert - The comment is more of a reflection of my poor understanding of regular languages and finite automata. I've been reading "The new Turing omnibus" and am still rather hazy on these... — Oded, Mar 02 '12 at 19:53
@Oded: Consider the following algorithm: Draw the graph corresponding to a DFA for a given regular expression. Is the start state a goal state? Then the empty string is a member of the language. Now enumerate all paths that are *one unit* long from the start state to any goal state. Those yield the one-character strings that are in the language. Now enumerate all the paths that are *exactly two units* from the start state to a goal state. Those are the two-character strings in the language... keep going and you'll get all the strings in the language, sorted by length. — Eric Lippert, Mar 02 '12 at 21:47
@EricLippert - note that while this works for true regular expressions, .NET "regular expressions" contain many irregular features (such as backreferences). I believe that it may be undecidable whether an arbitrary .NET regex matches any strings. — kvb, Mar 02 '12 at 22:36

score 1 · Answer 1 · answered Mar 02 '12 at 15:15

This isn't even a C#-specific question; I think you can do this with any true regex.

It seems to me like you should be able to tell a generation story for any regex match that's just a list of rewrites. In your example [ab]c*b* can generate acccbbb; that's [ab]c*b*->ac*b*->acccb*->acccbbb. For each operator we can imagine enumerating all the ways it rewrites; then it's just a question of enumerating all combinations of rewrites, which boils down to enumerating all the N-tuples of naturals.

edit: N-tuples of naturals is a glib comparison. But you could imagine essentially performing a breadth-first traversal over rewrite states, outputting each string that all operators have been rewritten out of.

you can transform your regex as a finite state automaton and then explore the graph with some kind of heuristic. But really, I don't have the time to do it myself ;) — edeboursetty, Mar 02 '12 at 15:59

score 0 · Answer 2 · answered Jan 27 '15 at 10:17

I don't know how to do this in C#, but in theory yes, it can be done.

You need to convert your regular expression to a NFA or DFA graph, transverse it with a BFS keeping track of the current path, adding a new character to the path for each edge, and printing the current path when the finish nodes are hit. Depending on the regular expression at hand your memory usage can easily grow exponentially.

For example given the regular expression (a|b)*abb we can create a NFA graph as the following:

NFA for <code>(a|b)*abb</code>

This NFA graph can be used both to recognize a word and to enumerate all possible words. We do that by nondeterministically traversing the graph. Meaning, we need to keep track of all possible paths in the graph.

Starting at zero we do a BFS, and for each node that has two or more output edged we create a new nondeterministic path. The BFS visits the nodes in the following order, each time printing:

0, 1, 7, 2, 4, 8, 3, 5, 9, 6, 6, 10, 1, 1, 7, ...

For each node visited we have the intermediate temporary paths as:

0, ""
1, "e"
7, "e"
2, "ee"
4, "ee"
8, "ea"
3, "eea"
5, "eeb"
9, "eab"
6, "eeae"
6, "eebe"
10, "eabb"
1, "eeaee"
1, "eebee"

The "e" symbol is the epsilon letter representing the empty string "", which should be filtered out while printing each word.

By doing a BFS over the graph we're sorting each word by the number of edges needed to recognize the word with the NFA back again. Since the graph contains a cycle this procedure will never finish.

Each time each nondeterministic path reaches the ending node 10 we print the generated string:

"abb"
"aabb"
"babb"

Regular expression: language generator

2 Answers2