3

I have the following text:

aabbaa
aa bbc aa

bbg

aa           bbd   aa

I would like to find words that starts with bb that are not between aa and aa regardless of whitespaces preceding or following matching word using PCRE . In the above example only bbg should be matched.

I have created the following pattern:

(?<!aa)bb(\w)*(?!aa)

However only aabbaa is not matched and other do. I don't know how can I use \s* inside negative look ahead/behind to get desired result. It seems it cannot be simple done using:

(?<!aa\s*)bb(\w)*(?!\s*aa)

How can it be done?

Marcin Nabiałek
  • 109,655
  • 42
  • 258
  • 291

1 Answers1

3

(*SKIP)(*F) Magic (No Lookaheads Needed)

Use this:

(\baa\b).*?\1(*SKIP)(*F)|\bbb\w+\b

See the match in the demo.

This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

The left side of the alternation | matches complete aa ... aa strings then deliberately fails, after which the engine skips to the next position in the string. The right side matches the bb... words you want, and we know they are the right ones because they were not matched by the expression on the left.

Reference

Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • Why is there literally no mention of these features in the documentation...? I could've used that years ago! – Niet the Dark Absol Jul 09 '14 at 12:12
  • At the moment I have no idea how it works but I'll try to understand it soon. Thank you – Marcin Nabiałek Jul 09 '14 at 12:14
  • Marcin, the linked article explains it in great detail. It's a beautiful and simple technique. :) – zx81 Jul 09 '14 at 12:16
  • @NiettheDarkAbsol Here's the [Perl doc](http://perldoc.perl.org/perlre.html#Special-Backtracking-Control-Verbs) on the subject, there's also a bit in the `PCRE` doc but not as much detail. :) – zx81 Jul 09 '14 at 12:17
  • 1
    Marcin, the key to understand is that there are two sides of the `|`... The left side is used to SKIP what you don't want, i.e. the `aa..aa` stuff... Then on the right you can freely match `bbetc` because any bad context has already been neutralised. It's simple and powerful. :) – zx81 Jul 09 '14 at 12:19