1

I am trying to match the contents between AB and BA using extended regex, for instance using awk.

Consider the two example strings AB12BABA and AB123BABA, I tried the following regex

AB([^B]|([^B][^A]|B[^A]|[^B]A))*BA

But it matches the whole string (greedy) for both examples.

Can anyone explain how the regex engine works for this case, and how I should change my regex so that it would work.

Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174
  • adding question mark `*?` makes it lazy – revo Jan 12 '14 at 21:11
  • @revo Not available in `awk` I think? – Håkon Hægland Jan 12 '14 at 21:12
  • @BastiM No `*?` is not available, the tokens can only be used separately as `*` or `?`. The combination `*?` does not have any special interpreation in `awk` (like change meaning of `*` to become lazy).. – Håkon Hægland Jan 12 '14 at 21:36
  • @HåkonHægland sorry, in this case i haven't said anything *vanishes* – KeyNone Jan 12 '14 at 21:39
  • @TimPietzcker I hoped to get an explanation of how the regex engine works, the example in this question just happend to be similar to the other question.. The other question was about finding the regex. This question is about how the regex engine works for alternation combined with repetition.. – Håkon Hægland Jan 12 '14 at 21:39
  • It looked very much like the old, non-working regex. Why aren't you using the regex from the other solution? Doesn't it do what you wanted? Which part of the explanation was unclear? I can try to explain it better. – Tim Pietzcker Jan 12 '14 at 21:44
  • @TimPietzcker Thanks, but I think I will study your solution more closely and come back to you later. I am also trying to understand how I can extend the solution to three characters like `ABC.*CBA`.. So I need to understand it more in depth.. – Håkon Hægland Jan 12 '14 at 21:50
  • An extension is going to be very difficult with a POSIX ERE engine, especially if the delimiters get longer. Why does it have to be awk? – Tim Pietzcker Jan 12 '14 at 22:00

2 Answers2

3

The BRE and ERE engines will match with the Leftmost Longest Rule, which is different from how Perl and other NFA-based regex engines matches the regex.

The documentation from Boost library is more detailed in regards to the technical aspect, so I quote it here:

The Leftmost Longest Rule

Often there is more than one way of matching a regular expression at a particular location, for POSIX basic and extended regular expressions, the "best" match is determined as follows:

  1. Find the leftmost match, if there is only one match possible at this location then return it.
  2. Find the longest of the possible matches, along with any ties. If there is only one such possible match then return it.
  3. If there are no marked sub-expressions, then all the remaining alternatives are indistinguishable; return the first of these found.
  4. Find the match which has matched the first sub-expression in the leftmost position, along with any ties. If there is only on such match possible then return it.
  5. Find the match which has the longest match for the first sub-expression, along with any ties. If there is only one such match then return it.
  6. Repeat steps 4 and 5 for each additional marked sub-expression.
  7. If there is still more than one possible match remaining, then they are indistinguishable; return the first one found.

Marked sub-expression as mentioned in the text refers to () capturing groups. Note that they only does capturing and back-reference is not supported.


Therefore, in order to do a lazy matching, you need to construct a regular expression, such that it matches the repeated part, while avoid matching the tail part until the very end. Since ERE and BRE are equivalent to theoretical regular expression, as long as you can construct a DFA, there exists an equivalent regex that does the trick (just that constructing it is not trivial task in some cases).

For your requirement, this regex shall work:

AB([^B]|B+[^AB])*B*BA

The part ([^B]|B+[^AB])*B* matches any string that does not contain the string "BA".

Derivation

This is the DFA for matching a string that does not contain the string "BA".

DFA

The notation here is not standard, so I will explain a bit:

  • State q1/B means that the state is named q1 (just like how you name a variable), B is the current progress towards matching BA.
  • * means any character in the alphabet. [^B] means any character in the alphabet except for B.

In the DFA, q0 and q1 are final states, q0 is the initial state. Note that q2 is a trap state, since it is a non-final state, and there is no transition out of this state.

Use the steps here, or just use JFLAP to derive the regular expression. (In JFLAP, you should use some character, such as C to represent [^AB]).

Since q2 is a trap state, we can exclude it from the formula:

R0 =  [^B]R0 + BR1 + λ
R1 = [^AB]R0 + BR1 + λ

Apply Arden's theorem to R1:

R1 = B*([^AB]R0 + λ)

Substitute R1 to R0:

R0 = [^B]R0 + BB*([^AB]R0 + λ) + λ

Distribute BB* over ([^AB]R0 + λ):

R0 = [^B]R0 + BB*[^AB]R0 + BB*λ + λ

Group together:

R0 = ([^B] + BB*[^AB])R0 + (BB* + λ)

Apply Arden's theorem to R0:

R0 = ([^B] + BB*[^AB])*(BB* + λ)

(BB* OR λ (empty string)) is equivalent to B*:

R0 = ([^B] + BB*[^AB])*B*

Let use rewrite it into awk's syntax: ([^B]|B+[^AB])*B*, which is what shown above.

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • Thanks! This is exactly what I was looking for :) Can you explain step 4 in more detail? And how it relates to `([^AB]|B|[^B]A)*`? – Håkon Hægland Jan 13 '14 at 13:58
  • @HåkonHægland: Since Tim Pietzcker's regex is incorrect, I am going to expand my answer and explain on how to get the answer. – nhahtdh Jan 14 '14 at 15:20
0

Use look arounds and a non greedy quantifier:

(?<=AB).*?(?=BA)

If you want to match the delimiters too, simply:

AB.*?BA
Bohemian
  • 412,405
  • 93
  • 575
  • 722