Algorithm to generate context free grammar from any regex

Question

Can anyone outline for me an algorithm that can convert any given regex into an equivalent set of CFG rules?

I know how to tackle the elementary stuff such as (a|b)*:

S -> a A
S -> a B
S -> b A
S -> b B
A -> a A
A -> a B
A -> epsilon
B -> b A
B -> b B
B -> epsilon
S -> epsilon (end of string)

However, I'm having some problem formalizing it into a proper algorithm especially with more complex expressions that can have many nested operations.

I hope you are kidding when you ask us for an entire outline for the algorithm. As you may have noticed that would be a lot of work. If you have a specific question about a specific problem feel free to ask, but don't ask us to practically design your code for you. — Jasper, Oct 30 '12 at 13:18
Shouldn't your cfg be `S -> a S; S -> b S; S -> epsilon`? I think the only string your provided cfg will match is "" because no other string it accepts is finite. — Wug, Oct 30 '12 at 13:21
This also really depends on which regex syntax elements you want to allow? Only regex in a theoretical sense? Or regex in the extended sense in which it is used in most engines? — Martin Ender, Oct 30 '12 at 13:22
@Wug State/Symbol S is a starting state in the example I gave. Oh and I did forget a couple of rules, edited. — gamerx, Oct 30 '12 at 13:26
Why are nested operation a problem? Can't you just break out any parenthesized expression into its own rule? — Emil Vikström, Oct 30 '12 at 13:33

Martin Ender · Accepted Answer · 2012-10-30T14:03:26.953

If you are just talking about regular expressions from a theoretical point of view, there are these three constructs:

ab       # concatenation
a|b      # alternation
a*       # repetition or Kleene closure

What you could then just do:

create a rule S -> (fullRegex)
for every repeated term (x)* in fullRegex create a rule X -> x X and X -> ε, then replace (x)* with X.
for every alternation (a|b|c) create rules Y -> a, Y -> b and Y -> c, then replace (a|b|c) with Y

Simply repeat this recursively (note that all x, a, b and c can still be complex regular expressions). Note that of course you have to use unique identifiers for every step.

This should be enough. This will certainly not give the most elegant or efficient grammar, but that is what normalization is for (and it should be done in a separate step and there are well-defined steps to do this).

One example: a(b|cd*(e|f)*)*

S -> a(b|cd*(e|f)*)*

S -> a X1; X1 -> (b|cd*(e|f)*) X1; X1 -> ε

S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> cd*(e|f)*

S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> c X2 (e|f)*; X2 -> d X2; X2 -> ε

... and a few more of those steps, until you end up with:

S  -> a X1
X1 -> Y1 X1
X1 -> ε
Y1 -> b
Y1 -> c X2 X3
X2 -> d X2
X2 -> ε
X3 -> Y2 X3
X3 -> ε
Y2 -> e
Y2 -> f

Algorithm to generate context free grammar from any regex

1 Answers1

Linked