3

suppose we want to match all one(s) between <out>...</out> in this text (option: dot matches all):

<out>hello!</out>
<nx1>home one</nx1>
<nx2>living</nx2>
<out>one text
text one continues 
and at last here ends one</out>
<m2>dog one</m2>
<out>bye!</out>

let's say we use this pattern:

one(?=(?:(?!<out>).)*</out>)

I really appreciate it if someone explains how regex engine process that pattern step-by-step and where it would be(position in the original text) in every phase of processing;(something like accepted @Tim Pietzcker's helpful explanation for this question: Regex - lookahead assertion)

Community
  • 1
  • 1
wiki
  • 1,877
  • 2
  • 31
  • 47

2 Answers2

4

Many tools exist to automatically explain what your regex does, character by character.

The idea behind it, though, is that you want to check one is followed by </out> while forbidding to enter a new out tag: if there's a ...</out> following and we haven't entered a new <out>...</out> structure, we know we are inside one already.

So the regex will match one if it is followed by </out> and if there's no <out> between the two.

The work is done by (?:(?!<out>).)*: the . matches only if it is not the first < in <out>. So we can go up to </out> only by consuming characters that are not this < followed by out>.


A speed improvement would be:

one(?=(?:[^<]*+|<(?!out>))*+</out>)

Stepping inside the negative lookahead at each character greatly increases the cost of matching this character. Here [^<]*+ will match directly up to the next suspicious <, and we perform the negative look ahead check only when we have to.

Robin
  • 9,415
  • 3
  • 34
  • 45
  • thanks for your response; I didn't understand the role of `.` and `*` in `(?:(?!).)*`; so engine check 5 characters(length of ``) from where it is right now(just after `one` (line4)) and see that it's not that(it's:" text") and now `.`: which is line break??? – wiki May 29 '14 at 09:12
  • @wiki: Don't forget that look-ahead are [zero width](http://www.regular-expressions.info/lookaround.html) and don't consume characters! `(?!).` on `abc` will match `a`: we start behind the first character, and `(?!)` checks that the first 5 characters aren't ``. There aren't, so the regex goes on (we're still before the `a`) and `.` matches `a`. When repeating the pattern with `*`, the regex performs the lookahead check at each character before matching it. Is it clearer? – Robin May 29 '14 at 09:19
1

Here's the explanation taken from here:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  one                      'one'
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        <out>                    '<out>'
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      .                        any character except \n
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
    </out>                   '</out>'
--------------------------------------------------------------------------------
  )                        end of look-ahead
Amit Joki
  • 58,320
  • 7
  • 77
  • 95
  • 1
    thanks but RegexBuddy can do much better than that! I know that `(?!` is a negative lookahead assertion! but what I want to know is it's function in that above mentioned pattern; check the @Tim Pietzcker's explanation – wiki May 29 '14 at 08:57