Negative Lookahead Regex greed (why is .*? too greedy)

Question

I'm having trouble understanding the finer details of negative lookahead regular expressions. After reading Regex lookahead, lookbehind and atomic groups, I thought I had a good summary of negative lookaheads when I found this description:

(?!REGEX_1)REGEX_2

Match only if REGEX_1 does not match; after checking REGEX_1, the search for REGEX_2 starts at the same position.

Hoping I understood the algorithm, I cooked up a two-sentence test insult; I wanted to find the sentence without a certain word. Specifically...

Insult: 'Yomama is ugly. And, she smells like a wet dog.'

Requirements:

Test 1: Return a sentence without 'ugly'.

Test 2: Return a sentence without 'looks'.

Test 3: Return a sentence without 'smells'.

I assigned the test words to $arg, and I used (?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.) to implement the test.

(?![A-Z].*?$arg.*?\.) is a negative lookahead to reject a sentence with the test word
([A-Z].*?\.) matches at least one sentence.

The critical piece seems to be in understanding where the regex engine starts matching after processing the negative lookahead.

Expected Results:

Test 1 ($arg = "ugly"): "And, she smells like a wet dog."

Test 2 ($arg = "looks"): "Yomama is ugly."

Test 3 ($arg = "smells"): "Yomama is ugly."

Actual Results:

Test 1 ($arg = "ugly"): "And, she smells like a wet dog." (Success)

Test 2 ($arg = "looks"): "Yomama is ugly." (Success)

Test 3 ($arg = "smells"): Failed, no match

At first I thought Test 3 failed because ([A-Z].*?\.) was too greedy and matched both sentences; however, (?:(?![A-Z].*?$arg.*?\.))([A-Z][^\.]*?\.) didn't work either. Next I wondered whether there was a problem with the python negative lookahead implementation, but perl gave me exactly the same result.

Finally I found the solution, I had to reject periods in my .*? portion of the expressions by using [^\.]*?; so this regex works: (?:(?![A-Z][^\.]*?$arg[^\.]*?\.))([A-Z][^\.]*?\.)

Question

However, I have another concern; "Yomama is ugly." does not have "smells" in it. So, if .*? is supposed to be a non-greedy match, why can't I complete Test 3 with (?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)?

EDIT

In light of @bvr's excellent suggestion to use -Mre=debug, I will consider this some more after work. It certainly looks like Seth's description is accurate at this point. What I learned so far is that negative lookahead expressions will match whenever possible, even if I put non-greedy .*? operators in the NLA.

Python Implementation

import re

def test_re(arg, INSULTSTR):
    mm = re.search(r'''
        (?:                  # No grouping
        (?![A-Z].*?%s.*?\.)) # Negative zero-width
                             #     assertion: arg, followed by a period
        ([A-Z].*?\.)         # Match a capital letter followed by a period
        ''' % arg, INSULTSTR, re.VERBOSE)
    if mm is not None:
        print "neg-lookahead(%s) MATCHED: '%s'" % (arg, mm.group(1))
    else:
        print "Unable to match: neg-lookahead(%s) in '%s'" % (arg, INSULTSTR)


INSULT = 'Yomama is ugly.  And, she smells like a wet dog.'
test_re('ugly', INSULT)
test_re('looks', INSULT)
test_re('smells', INSULT)

Perl Implementation

#!/usr/bin/perl

sub test_re {
    $arg    = $_[0];
    $INSULTSTR = $_[1];
    $INSULTSTR =~ /(?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)/;
    if ($1) {
        print "neg-lookahead($arg) MATCHED: '$1'\n";
    } else {
        print "Unable to match: neg-lookahead($arg) in '$INSULTSTR'\n";
    }
}

$INSULT = 'Yomama is ugly.  And, she smells like a wet dog.';
test_re('ugly', $INSULT);
test_re('looks', $INSULT);
test_re('smells', $INSULT);

Output

neg-lookahead(ugly) MATCHED: 'And, she smells like a wet dog.'
neg-lookahead(looks) MATCHED: 'Yomama is ugly.'
Unable to match: neg-lookahead(smells) in 'Yomama is ugly.  And, she smells like a wet dog.'

Other failures: `test_re('Yomama',$INSULT);` and `test_re('And',$INSULT);` — Seth Robertson, May 25 '11 at 03:41
@Mike: Yes, you are getting matches, but they are bad matches. It is returning a sentence with the bad word in it. — Seth Robertson, May 25 '11 at 03:49
Regarding your negative lookahead, what's the point of everything following `$arg`? Seems to me that `(?![A-Z][^\.]*?$arg)` will fail just as well as what you have if `$arg` is encountered (failing being the desired behavior here). But I don't know Perl or Python. — harpo, May 25 '11 at 03:56
@harpo, I used `(?![A-Z].*?$arg.*?\.)` in the negative lookahead because I want to reject a single sentence with `$arg` in it; however, I wanted to be 100% sure that I avoided match into the second sentence when possible. Thus I explicitly matched the first period after `$arg` — Mike Pennington, May 25 '11 at 03:59
@harpo, that may be true, but my question is not so much what works (I almost got that right); it is why the original is too greedy in the face of `.*?` in the expressions. I need to think through Seth's latest response to see if that answers the question. — Mike Pennington, May 25 '11 at 04:06

score 3 · Accepted Answer · answered May 25 '11 at 03:44

3

#!/usr/bin/perl

sub test_re {
    $arg    = $_[0];
    $INSULTSTR = $_[1];
    $INSULTSTR =~ /(?:^|\.\s*)(?:(?![^.]*?$arg[^.]*\.))([^.]*\.)/;
    if ($1) {
        print "neg-lookahead($arg) MATCHED: '$1'\n";
    } else {
        print "Unable to match: neg-lookahead($arg) in '$INSULTSTR'\n";
    }
}

$INSULT = 'Yomama is ugly.  And, she smells like an wet dog.';
test_re('Yomama', $INSULT);
test_re('ugly', $INSULT);
test_re('looks', $INSULT);
test_re('And', $INSULT);
test_re('And,', $INSULT);
test_re('smells', $INSULT);
test_re('dog', $INSULT);

Results:

neg-lookahead(Yomama) MATCHED: 'And, she smells like an wet dog.'
neg-lookahead(ugly) MATCHED: 'And, she smells like an wet dog.'
neg-lookahead(looks) MATCHED: 'Yomama is ugly.'
neg-lookahead(And) MATCHED: 'Yomama is ugly.'
neg-lookahead(And,) MATCHED: 'Yomama is ugly.'
neg-lookahead(smells) MATCHED: 'Yomama is ugly.'
neg-lookahead(dog) MATCHED: 'Yomama is ugly.'

answered May 25 '11 at 03:44

Seth Robertson

30,608
7
64
57

Seth, you certainly seem to have a better implementation; however, I'm trying to understand why my original code breaks WRT greediness. – Mike Pennington May 25 '11 at 03:53
2

My suspicion is that the problem is your asking for the second sentence to match by moving the pattern start location while still doing non-greedy matches. The `.*?` is allowed to match `.` so there is no particular reason for the re engine to try to move the pattern start space. If you change the `.*?` to `[^.]*` (greedy or not) the system will work (modula the first word match problem). – Seth Robertson May 25 '11 at 04:03
1

To help explain the problem try a pattern `[A-Z].*?smells.*?\.`. It matches the whole string. It should work with just `((?![A-Z][^\.]*?smells.*?\.)[A-Z].*?\.)` – James Khoury May 25 '11 at 04:13
@Mike: This might make what is happening more clear: `` perl -e '$_="Yomama is ugly. And, she smells like an wet dog."; $arg="smells"; print "Matched <$1>\n" if /([A-Z].*?$arg.*?\.)/;' `` The pattern start does not change and `.` is matched by `.*?`. – Seth Robertson May 25 '11 at 04:14

score 3 · Answer 2 · answered May 25 '11 at 04:55

3

If you're curious about what Perl is doing with a regex, you can run with the regex debugger:

perl -Dr -e '"A two. A one." =~ /(?![A-Z][^\.]*(?:two)[^\.]*\.)([A-Z][^\.]+\.)/; print ">$1<\n"'

which will generate much output for you to ponder. You will need a Perl built with -DDEBUGGING.

answered May 25 '11 at 04:55

Alex

5,863
2
29
46

3

Even perl without DEBUGGING can do it with `perl -Mre=debug -e ...` – bvr May 25 '11 at 09:26
@bvr, thank you for `-Mre=debug`; that option also allows me to run the script from-disk without `-e`, which is what I want in this case. – Mike Pennington May 25 '11 at 11:04

score 2 · Answer 3 · answered May 25 '11 at 04:19

Your problem is that the regex engine will try as hard as possible to match (?![A-Z].*?$arg.*?\.), so with the "smells" case, it ends up matching the whole string. (The period in the middle is then included in one of the .*? constructs.) You should restrict the negative lookahead case to match only as much as the other case can:

Instead of:

(?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)

Use:

(?:(?![A-Z][^.]*$arg[^.]*\.))([A-Z].*?\.)

Now, the negative lookahead cannot match more of the string than the other part can, since it must stop at the first period.