Which one is better from a performance perspective? My guess is the second one since its DFA will have a smaller number of states.
Both of the expressions have the same number of states in the minimal DFA, and their DFAs matches the same "language" (in theoretical sense).
Regardless of the number of states in the DFA, the performance is theoretically the same, since you will traverse through the states deterministically as you feed the input to the automaton.
In practice, there might be difference due to cache miss, which may happen more often when there are more states. However, unless you are working on the regex engine, I can't think of any good reason to spend time doing black-box optimization.
Is there a general purpose programming language whose built-in regex compiler will reduce the given regex into the minimal DFA?
Go (re2) and Haskell has engines that converts regex into DFA. I don't know whether the DFA is minimal or not, though.
Since POSIX ERE doesn't support back-reference (back-reference is different from reference to capturing group in replacement), it is possible to write an engine for POSIX ERE that runs on DFA or efficient NFA-simulation. However, since the standard doesn't mandate such implementation as long as the result is correct (match the left-most longest string), the implementation can exhaustively search for all strings that matches the regex on an NFA backtracking engine.
However, at least GNU egrep seems to implements POSIX ERE with DFA (based on the file name dfa.c
).
For your information, there are 3 approaches to implement a regular expression matching:
- DFA
- NFA-simulation algorithm
- NFA backtracking
For more details, this article (cited in this question) explains:
- For (theoretical) regular expression, there exists efficient NFA simulation algorithm with sub-match tracking (capturing group).
- Why backtracking engine is so prominent (e.g. Java, Python, Perl2, ...)
2: Perl implements a backtracking engine with memoization.
- Backtracking engine may take exponential time on the length of input, while Thompson's NFA simulation algorithm takes O(mn) time where m is the length of the regular expression and n is the length of input.
- The most efficient algorithm to match (ir)regular expression with backreference currently known is backtracking approach. Therefore, some engines decide not to support backreference to increase efficiency in matching.
- Backtracking engines (even with memoization) are slower than Thompson's NFA simulation algorithm.
By the way, re2 engine (mentioned above) includes implementation of DFA-based and NFA-based (efficient simulation) matching algorithm.