Using http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
with the regexp string:
(a|b)*ab
The DFA generated from NFA has three states:
S0: initial state
S1: after an 'a'
S2: after a 'b' that doesn't follow an a (follows a 'b' or is initial 'b')
S3: after a 'b' in S1
S3: accept
Existing implementations for regular expressions often allow backreferences (not so RE2), and use backtracking rather than NFA -> DFA, and are not therefore DFA (the backtracking logic adds state, similar to recursive call stack adding state). If you don't want to use backtracking, one way to parse expressions like this may be going backwards from accept state:
ab
S0 -a-> S1 -b-> S2 (accept)
[ab]*ab
S0 -a-> S1 -b-> S2 (accept)
Sn -a-> S1
S0,S2 -b-> S3
The NFA reduction allows for seeing that the case of 'a' is different (as it is partial match of existing final sequence), we should go to state S1 instead. The same link above gives larger than necessary DFA diagram for a longer string (their states 2 and 3 should be combined to a single state, any 'b' or 'c' stays in state, 'a' goes out to state 1):
(a|b|c)*abc
To create DFA yourself, you can first generate the original DFA for abc:
S0 -a-> S1 -b-> S2 -c-> S3 (accept)
and then fill in whatever doesn't match this track (top one highest priority, overrides whatever is next):
S0 -a-> S1 -b-> S2 -c-> S3 (accept)
Sn -a-> S1 (get on our track of final string / ambiguity state)
Sn -b,c-> S4 (b or c; get off our track)
This fills in to:
S0 -a-> S1 -b-> S2 -c-> S3 (accept)
S1,S2,S3,S4 -a-> S1
S0,S3,S4 -b,c-> S4
S1 -c-> S4
S2 -b-> S4
If we instead had:
(a|c)*aac
You should think of the accept track similar to abc:
S0 -a-> S1 -a-> S2 -c-> S3 (accept)
But also notice the sequence 'aa'. The ambiguity is relatively simple, still, as seeing any of these should keep you in state S2:
aa aaa aaaa aaaaa...a
The ambiguities are:
- Is the 'a' the first 'a' of our accept track?
- Is the second 'a' in 'aa' the first or second 'a' of our accept track?
We can hierarchically fill in states from the track, but the ambiguity overlaps to S2 this time as well:
S0 -a-> S1 -a-> S2 -c-> S3 (accept)
S2 -a-> S2 (additional here, the second ambiguity)
Sn -a-> S1 (get on our track of final string / ambiguity state)
Sn -c-> S4 (get off our track)
This fills in to:
S0 -a-> S1 -a-> S2 -c-> S3 (accept)
S2 -a-> S2
S3,S4 -a-> S1
S0,S1,S3,S4 -c-> S4
My general observation (as someone who taught compiler design):
- For simple regexps, there is very often a way to understand the overlapped states of
DFA that come from the NFA
- For smaller cases, you may be able to create simple mechanisms
- Any larger cases will still require NFA -> DFA processing
Beyond simple cases, to use backreferences to groupings and keeping that state as well, you are not talking about simple DFA anymore. It can still be converted to DFA, but if you don't backtrack (not a part of your standard DFA, backtracking logic adds additional state), your state space will be exponentially large, and there is no general simple approach to directly convert to DFA, you should really construct NFA first.
See How do backreferences in regexes make backtracking required?