1

I am trying to figure out why a regex with negative look ahead fails when the "single line" option is turned on.

Example (simplified):

<source>Test 1</source>
<source>Test 2</source>
<target>Result 2</target>
<source>Test 3</source>

This:

<source>(?!.*<source>)(.*?)</source>(?!\s*<target)

will fail if the single line option is on, and will work if the single line option is off. For instance, this works (disables the single line option):

(?-s:<source>(?!.*<source>)(.*?)</source>(?!\s*<target))

My understanding is that the single line mode simply allows the dot "." to match new lines, and I don't see why it would affect the expression above.

Can anyone explain what I am missing here?

::::::::::::::::::::::

EDIT: (?!.*) is a negative look ahead not a capturing group.

 <source>(?!.*?<source>)(.*?)</source>(?!\s*<target)

will ALSO FAIL if the single line mode is on, so it doesn't look like this is a greediness issue. Try it in a Regex designer (like Expresso or Rad regex):

With single line OFF, it matches (as expected):

<source>Test 1</source>    
<source>Test 3</source>

With single line ON:

<source>Test 3</source>

I don't understand why it doesn't match the first one as well: it does not contain the first negative look ahead, so it should match the expression.

Sylverdrag
  • 8,898
  • 5
  • 37
  • 54
  • 1
    Do yourself a favor by parsing this using an html parser instead of regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Amarghosh Jun 01 '10 at 08:18
  • Waiting for a comment from John Saunders: 3...2...1... – Tim Pietzcker Jun 01 '10 at 08:33
  • @Amarghosh. Not relevant in my context. Yes, there are contexts where using regex IS the thing to do. – Sylverdrag Jun 01 '10 at 09:42

2 Answers2

2

The reason why it "fails" is because you seem to have misplaced the negative lookahead.

<source>(?!.*<source>)(.*?)</source>(?!\s*<target)
        ^^^^^^^^^^^^^^

Now, let's consider what (?!.*<source>) does here: it's a lookahead that says that there is NO match for .*<source> from that position.

Well, in single-line mode, . matches everything. After matching the first two <source>, there IS in fact .*<source>! So the negative lookahead fails for the first two <source>.

On the last <source>, .*<source> no longer match, so the negative lookahead succeeds. The rest of the pattern also succeeds, and that's why you only get <source>Test 3</source> in single-line mode.

polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
2

I believe this is what you're looking for:

<source>((?:(?!</?source>).)*)</source>(?!\s*<target)

The idea is that you match each character one at a time, but only after making sure it isn't the first character of </source>. Also, with the addition of /? to the lookahead, you don't have to use a non-greedy quantifier.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156