The BRE and ERE engines will match with the Leftmost Longest Rule, which is different from how Perl and other NFA-based regex engines matches the regex.
The documentation from Boost library is more detailed in regards to the technical aspect, so I quote it here:
The Leftmost Longest Rule
Often there is more than one way of matching a regular expression at a particular location, for POSIX basic and extended regular expressions, the "best" match is determined as follows:
- Find the leftmost match, if there is only one match possible at this location then return it.
- Find the longest of the possible matches, along with any ties. If there is only one such possible match then return it.
- If there are no marked sub-expressions, then all the remaining alternatives are indistinguishable; return the first of these found.
- Find the match which has matched the first sub-expression in the leftmost position, along with any ties. If there is only on such match possible then return it.
- Find the match which has the longest match for the first sub-expression, along with any ties. If there is only one such match then return it.
- Repeat steps 4 and 5 for each additional marked sub-expression.
- If there is still more than one possible match remaining, then they are indistinguishable; return the first one found.
Marked sub-expression as mentioned in the text refers to ()
capturing groups. Note that they only does capturing and back-reference is not supported.
Therefore, in order to do a lazy matching, you need to construct a regular expression, such that it matches the repeated part, while avoid matching the tail part until the very end. Since ERE and BRE are equivalent to theoretical regular expression, as long as you can construct a DFA, there exists an equivalent regex that does the trick (just that constructing it is not trivial task in some cases).
For your requirement, this regex shall work:
AB([^B]|B+[^AB])*B*BA
The part ([^B]|B+[^AB])*B*
matches any string that does not contain the string "BA".
Derivation
This is the DFA for matching a string that does not contain the string "BA".

The notation here is not standard, so I will explain a bit:
- State q1/B means that the state is named q1 (just like how you name a variable), B is the current progress towards matching BA.
*
means any character in the alphabet. [^B]
means any character in the alphabet except for B.
In the DFA, q0 and q1 are final states, q0 is the initial state. Note that q2 is a trap state, since it is a non-final state, and there is no transition out of this state.
Use the steps here, or just use JFLAP to derive the regular expression. (In JFLAP, you should use some character, such as C
to represent [^AB]
).
Since q2 is a trap state, we can exclude it from the formula:
R0 = [^B]R0 + BR1 + λ
R1 = [^AB]R0 + BR1 + λ
Apply Arden's theorem to R1:
R1 = B*([^AB]R0 + λ)
Substitute R1 to R0:
R0 = [^B]R0 + BB*([^AB]R0 + λ) + λ
Distribute BB*
over ([^AB]R0 + λ)
:
R0 = [^B]R0 + BB*[^AB]R0 + BB*λ + λ
Group together:
R0 = ([^B] + BB*[^AB])R0 + (BB* + λ)
Apply Arden's theorem to R0:
R0 = ([^B] + BB*[^AB])*(BB* + λ)
(BB*
OR λ
(empty string)) is equivalent to B*
:
R0 = ([^B] + BB*[^AB])*B*
Let use rewrite it into awk
's syntax: ([^B]|B+[^AB])*B*
, which is what shown above.