1

Given the following text:

My name is foo.

My name is bar.

With the goal being to return each line which contains or does not contain a particular substring, both of the following positive and negative regex patterns can be used to return the same result:

Postive lookahead: ^(?=.*bar).*$ returns My name is bar.

Negative lookahead: ^((?!foo).)*$ returns My name is bar.

However, why does the negative lookahead need to be nested within multiple sets of parentheses with the qualifier . and the quantifier * separated by the parentheses whereas in the positive lookahead, they can be adjacent .*?

Community
  • 1
  • 1
singularity
  • 573
  • 4
  • 15

2 Answers2

3

The negative lookahead need to be nested within multiple sets of parentheses with the qualifier . and the quantifier * is called a tempered greedy token. You do not have to use it in this scenario.

You can use a normal lookahead anchored at the start instead of the tempered greedy token:

^(?!.*foo).*$

See the regex demo

Here,

  • ^ - matches the location at the start of the string
  • (?!.*foo) - a negative lookahead failing the match if there is foo somewhere on the line (or string if DOTALL mode is on)
  • .*$ - any 0+ characters (but a newline if DOTALL mode is off) up to the end of string/line.

What to use?

Tempered greedy token is usually much less efficient. Use the lookahead anchored at the start when you just need to check if a string contains something or not. However, the tempered greedy token might be required in some cases. See When to Use this Technique.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thanks for the additional information about tempered greedy tokens - I didn't know that's what I was doing. – singularity Apr 06 '16 at 19:34
  • 2
    Actually, I am rather disappointed at the fact that the famous [Regular expression to match line that doesn't contain a word?](http://stackoverflow.com/questions/406230/regular-expression-to-match-line-that-doesnt-contain-a-word) best answer dwells on this tempered greedy token. It is not the best solution for this task. The bottleneck is the unanchored lookahead that is executed at *every location* in the string, while in my answer, the lookahead is triggered right at the start, is executed just once. And it is enough to see if a string has the word or not. – Wiktor Stribiżew Apr 06 '16 at 19:39
  • Nice ref! That's the very answer that confused me! – singularity Apr 06 '16 at 19:46
1

Example

Given the string text = 'a123b456c', and we want to use substring '123' as an anchor

(?=123) Positive lookahead:    Matches substring '123' as a *forward* anchor 
(?<=123) Positive lookbehind:  Matches substring '123' as a *backward* anchor
(?!123) Negative lookahead:    Substring not matching '123' as a *forward* anchor
(?<!123) Negative lookbehind:  Substring not matching '123' as a *backward* anchor

'123' is only used as an anchor and not replaced. Too see how this works:

import re 

text = 'a123b456c'

re.sub('a(?=123)', '@', text) # outputs '@123b456c' note '123' not replaced
re.sub('(?<=123)b', '@', text) # outputs 'a123@456c' 
re.sub('b(?!123)', '@', text) # outputs 'a123@456c' since '456' not match '123'
re.sub('(?<!123)c', '@', text) # outputs 'a123b456@' 

Hope this helps

Community
  • 1
  • 1
Yi Xiang Chong
  • 744
  • 11
  • 9