Ambiguity when constructing DFA from regular expression with kleene star

Question

How to avoid ambiguity when directly constructing DFA from regular expression like [ab]*ab?

In a straightforward implementation a cycle transition for [ab]* eats all a or b`s, and it of course differs from existing implementations for regular expressions, they take into account "ab" at the end as well. So the implementing automata must somehow know when to jump to last two transitions "a to b" from the first transition cycle.

Any ideas how to achieve this?

I am interested in answers for direct DFA constructing, not NFA with transformation to DFA. Adrian McCarthy's answer probably took the point:

"The simplistic answer is that the automaton simply has a state that literally means it doesn't know yet which path was taken."

- but without detailed explanation.

score 3 · Accepted Answer · edited May 23 '17 at 10:25

Using http://hackingoff.com/compilers/regular-expression-to-nfa-dfa with the regexp string:

    (a|b)*ab

The DFA generated from NFA has three states:

    S0: initial state
    S1: after an 'a'
    S2: after a 'b' that doesn't follow an a (follows a 'b' or is initial 'b')
    S3: after a 'b' in S1
    S3: accept

Existing implementations for regular expressions often allow backreferences (not so RE2), and use backtracking rather than NFA -> DFA, and are not therefore DFA (the backtracking logic adds state, similar to recursive call stack adding state). If you don't want to use backtracking, one way to parse expressions like this may be going backwards from accept state:

    ab
    S0 -a-> S1 -b-> S2 (accept)

    [ab]*ab
    S0 -a-> S1 -b-> S2 (accept)
    Sn -a-> S1
    S0,S2 -b-> S3

The NFA reduction allows for seeing that the case of 'a' is different (as it is partial match of existing final sequence), we should go to state S1 instead. The same link above gives larger than necessary DFA diagram for a longer string (their states 2 and 3 should be combined to a single state, any 'b' or 'c' stays in state, 'a' goes out to state 1):

    (a|b|c)*abc

To create DFA yourself, you can first generate the original DFA for abc:

    S0 -a-> S1 -b-> S2 -c-> S3 (accept)

and then fill in whatever doesn't match this track (top one highest priority, overrides whatever is next):

    S0 -a-> S1 -b-> S2 -c-> S3 (accept)
    Sn -a-> S1            (get on our track of final string / ambiguity state)
    Sn -b,c-> S4          (b or c; get off our track)

This fills in to:

    S0 -a-> S1 -b-> S2 -c-> S3 (accept)
    S1,S2,S3,S4 -a-> S1
    S0,S3,S4 -b,c-> S4
    S1 -c-> S4
    S2 -b-> S4

If we instead had:

    (a|c)*aac

You should think of the accept track similar to abc:

    S0 -a-> S1 -a-> S2 -c-> S3   (accept)

But also notice the sequence 'aa'. The ambiguity is relatively simple, still, as seeing any of these should keep you in state S2:

    aa   aaa    aaaa    aaaaa...a

The ambiguities are:

Is the 'a' the first 'a' of our accept track?
Is the second 'a' in 'aa' the first or second 'a' of our accept track?

We can hierarchically fill in states from the track, but the ambiguity overlaps to S2 this time as well:

    S0 -a-> S1 -a-> S2 -c-> S3 (accept)
    S2 -a-> S2            (additional here, the second ambiguity)
    Sn -a-> S1            (get on our track of final string / ambiguity state)
    Sn -c-> S4            (get off our track)

This fills in to:

    S0 -a-> S1 -a-> S2 -c-> S3 (accept)
    S2 -a-> S2
    S3,S4 -a-> S1
    S0,S1,S3,S4 -c-> S4

My general observation (as someone who taught compiler design):

For simple regexps, there is very often a way to understand the overlapped states of DFA that come from the NFA
For smaller cases, you may be able to create simple mechanisms
Any larger cases will still require NFA -> DFA processing

Beyond simple cases, to use backreferences to groupings and keeping that state as well, you are not talking about simple DFA anymore. It can still be converted to DFA, but if you don't backtrack (not a part of your standard DFA, backtracking logic adds additional state), your state space will be exponentially large, and there is no general simple approach to directly convert to DFA, you should really construct NFA first.

See How do backreferences in regexes make backtracking required?

For me works your example with `(a|b|c)*abc`, which in my opinion shed light on my problem and answered the question. As I now understand, the implementation of kleene star for char union which conflicts with followed linear path, must take into account the followed linear path and "dissolve" the union kleene star into the transitions of states of the path. Although I cannot see for now any simple alghoritm how to achieve this in automatic way (for my simple regex parser), your answer is definitely what I needed. Thank you also for other comments. — Pavel Novotny, Oct 06 '14 at 19:11
Thank you for helpful comments that helped me fix my answer. Glad to hear this has helped you. I think [abcde]*abcde type regexps would be relatively simple, but the general case with multiple ambiguities will be hard to directly convert to DFA. — Kanat Bolazar, Oct 07 '14 at 13:52

Ambiguity when constructing DFA from regular expression with kleene star

1 Answers1