I'm having trouble understanding the finer details of negative lookahead regular expressions. After reading Regex lookahead, lookbehind and atomic groups, I thought I had a good summary of negative lookaheads when I found this description:
(?!REGEX_1)REGEX_2
Match only if
REGEX_1
does not match; after checkingREGEX_1
, the search forREGEX_2
starts at the same position.
Hoping I understood the algorithm, I cooked up a two-sentence test insult; I wanted to find the sentence without a certain word. Specifically...
Insult: 'Yomama is ugly. And, she smells like a wet dog.'
Requirements:
- Test 1: Return a sentence without 'ugly'.
- Test 2: Return a sentence without 'looks'.
- Test 3: Return a sentence without 'smells'.
I assigned the test words to $arg
, and I used (?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)
to implement the test.
(?![A-Z].*?$arg.*?\.)
is a negative lookahead to reject a sentence with the test word([A-Z].*?\.)
matches at least one sentence.
The critical piece seems to be in understanding where the regex engine starts matching after processing the negative lookahead.
Expected Results:
- Test 1 ($arg = "ugly"): "And, she smells like a wet dog."
- Test 2 ($arg = "looks"): "Yomama is ugly."
- Test 3 ($arg = "smells"): "Yomama is ugly."
Actual Results:
- Test 1 ($arg = "ugly"): "And, she smells like a wet dog." (Success)
- Test 2 ($arg = "looks"): "Yomama is ugly." (Success)
- Test 3 ($arg = "smells"): Failed, no match
At first I thought Test 3 failed because ([A-Z].*?\.)
was too greedy and matched both sentences; however, (?:(?![A-Z].*?$arg.*?\.))([A-Z][^\.]*?\.)
didn't work either. Next I wondered whether there was a problem with the python negative lookahead implementation, but perl gave me exactly the same result.
Finally I found the solution, I had to reject periods in my .*?
portion of the expressions by using [^\.]*?
; so this regex works: (?:(?![A-Z][^\.]*?$arg[^\.]*?\.))([A-Z][^\.]*?\.)
Question
However, I have another concern; "Yomama is ugly." does not have "smells" in it. So, if .*?
is supposed to be a non-greedy match, why can't I complete Test 3 with (?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)
?
EDIT
In light of @bvr's excellent suggestion to use -Mre=debug
, I will consider this some more after work. It certainly looks like Seth's description is accurate at this point. What I learned so far is that negative lookahead expressions will match whenever possible, even if I put non-greedy .*?
operators in the NLA.
Python Implementation
import re
def test_re(arg, INSULTSTR):
mm = re.search(r'''
(?: # No grouping
(?![A-Z].*?%s.*?\.)) # Negative zero-width
# assertion: arg, followed by a period
([A-Z].*?\.) # Match a capital letter followed by a period
''' % arg, INSULTSTR, re.VERBOSE)
if mm is not None:
print "neg-lookahead(%s) MATCHED: '%s'" % (arg, mm.group(1))
else:
print "Unable to match: neg-lookahead(%s) in '%s'" % (arg, INSULTSTR)
INSULT = 'Yomama is ugly. And, she smells like a wet dog.'
test_re('ugly', INSULT)
test_re('looks', INSULT)
test_re('smells', INSULT)
Perl Implementation
#!/usr/bin/perl
sub test_re {
$arg = $_[0];
$INSULTSTR = $_[1];
$INSULTSTR =~ /(?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)/;
if ($1) {
print "neg-lookahead($arg) MATCHED: '$1'\n";
} else {
print "Unable to match: neg-lookahead($arg) in '$INSULTSTR'\n";
}
}
$INSULT = 'Yomama is ugly. And, she smells like a wet dog.';
test_re('ugly', $INSULT);
test_re('looks', $INSULT);
test_re('smells', $INSULT);
Output
neg-lookahead(ugly) MATCHED: 'And, she smells like a wet dog.'
neg-lookahead(looks) MATCHED: 'Yomama is ugly.'
Unable to match: neg-lookahead(smells) in 'Yomama is ugly. And, she smells like a wet dog.'