If you are just talking about regular expressions from a theoretical point of view, there are these three constructs:
ab # concatenation
a|b # alternation
a* # repetition or Kleene closure
What you could then just do:
- create a rule
S -> (fullRegex)
- for every repeated term
(x)*
in fullRegex
create a rule X -> x X
and X -> ε
, then replace (x)*
with X
.
- for every alternation
(a|b|c)
create rules Y -> a
, Y -> b
and Y -> c
, then replace (a|b|c)
with Y
Simply repeat this recursively (note that all x,
a
, b
and c
can still be complex regular expressions). Note that of course you have to use unique identifiers for every step.
This should be enough. This will certainly not give the most elegant or efficient grammar, but that is what normalization is for (and it should be done in a separate step and there are well-defined steps to do this).
One example: a(b|cd*(e|f)*)*
S -> a(b|cd*(e|f)*)*
S -> a X1; X1 -> (b|cd*(e|f)*) X1; X1 -> ε
S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> cd*(e|f)*
S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> c X2 (e|f)*; X2 -> d X2; X2 -> ε
... and a few more of those steps, until you end up with:
S -> a X1
X1 -> Y1 X1
X1 -> ε
Y1 -> b
Y1 -> c X2 X3
X2 -> d X2
X2 -> ε
X3 -> Y2 X3
X3 -> ε
Y2 -> e
Y2 -> f