2

I'm having trouble understanding negative lookbehind in regular expressions.

For a simple example, say I want to match all Gmail addresses that don't start with 'test'.

I have created an example on regex101 here.

My regular expression is:

(?<!test)\w+?\.?\w+@gmail\.com

So it matches things like:

hagrid@gmail.com
harry.potter@gmail.com

But it also matches things like

test@gmail.com

where the original string was

test@gmail.com

I thought the (?<!test) should exclude that match?

Not a real meerkat
  • 5,604
  • 1
  • 24
  • 55
liliest
  • 29
  • 1
  • avoid regular expressions, they're evil. – Pablo Recalde Jan 15 '20 at 09:46
  • 2
    @PabloRecalde no, regular expressions are not evil. They're a software tool, and, like any software tool, we must consider its strenghts and flaws before using it. – Not a real meerkat Jan 15 '20 at 16:34
  • 1
    @PabloRecalde regexes are only "evil" when somebody who *doesn't* understand them tries to use them. Seeking to understand how regexes work is therefore the *opposite* of evil and should be encouraged. – VLAZ Jan 15 '20 at 19:50
  • Yes they are, lots and lots and lots of vulnerabilities come from their use, somebody who doesn't understand all the implications of using them and how their current regex engine works and is implemented means 95% of the programmer base. So, AVOID THEM. They're EVIL. Treat them as your last resort. – Pablo Recalde Jan 16 '20 at 10:10
  • 1
    @CássioRenan "*I would love to see the data to back it up.*" I've got it on good authority that 82% of all statistics is made up on the spot. – VLAZ Jan 17 '20 at 00:08
  • The point is that you want to fail a string starting with a certain pattern, and you need `^(?!pattern)...`, not `(?<!pattern)` – Wiktor Stribiżew Jan 21 '20 at 09:59

3 Answers3

0

(?<!test)\w+?\.?\w+@gmail\.com works by looking behind each character before moving forward with the match.

test@gmail.com
^

At the point marked by the ^ (before the 0th character), the engine looks behind and doesn't see "test", so it can happily march forward and match "test@gmail.com", which is legal per what remains of the pattern \w+?\.?\w+@gmail\.com.

Using a negative lookahead with a word boundary fixes the problem:

\b(?!test)\w+?\.?\w+@gmail\.com

Consider our target again on the updated regex:

test@gmail.com
^

At this point, the engine is at a word boundary \b, looks ahead and sees "test" and cannot accept the string.

You may wonder if the \b boundary is necessary. It is, because removing it matches "est@gmail.com" from "test@gmail.com".

test@gmail.com
 ^

The engine's cursor failed to match "test@gmail.com" from the 0th character, but after it steps forward, it matches "est@gmail.com" without problem, but that's not the intent of the programmer.

Demo of rejecting any email otherwise matching your format that begins with "test":

const s = `this is a short example hagrid@gmail.com of what I'm
trying to do with negative lookbehind test@gmail.com
harry.potter@gmail.com testasdf@gmail.com  @gmail.com 
a@gmail.com  asdftest@gmail.com`;
console.log([...s.matchAll(/\b(?!test)\w+?\.?\w+@gmail\.com/g)]);

Note that \w+?\.?\w+ enforces that if there is a period, it must be between \w+ substrings, but this approach rejects a (probably) valid email like "a@gmail.com" because it's only one letter. You might want \b(?!test)(?:\w+?\.?\w+|\w)@gmail\.com to rectify this.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
0

As the name suggests, the (?<! sequence is a negative lookbehind. So, the rest of the pattern would match only if it's not preceded by the look behind. This is determined by where the matching starts from.

Let's start simple - we define a regex .cde. and try to match it against some input:

First nine letters are abcdefgeh
                        ^   ^
                        |   |
.cde. start -------------   |
.cde. end   -----------------

See on Regex101

So now we can see that the match is bcdef and is preceded by (among other characters) a. So, if we use that as a negative lookbehind (?<!a).cde. we will not get a match:

First nine letters are abcdefgeh
                       ^^    ^
                       ||    |
`(?<!a)`      ----------|    |
.cde. start   -----------    |
.cde. end     ----------------

See on Regex101

We could match the .cde. pattern but it's preceded by a which we don't want.

However, what happens if we defined the negative lookahead differently - as (?<!b).cde.:

First nine letters are abcdefgeh
                        ^    ^
                        |    |
.cde. start   -----------    |
.cde. end     ----------------

See on Regex101

We get a match for bcdefg because there is no b before this match. Therefore, it works fine. Yes, b is the first character of the match but doesn't show up before it. And this is the core of the lookarounds (lookbehind and lookaheads) - they are not included in the main match. In fact they fall under zero length matches since, they will be checked but won't appear as a match. In effect, they only work starting from some position but check the part of the input that will not go in the final match.

Now, if we return to your regex - (?<!test)\w+?\.?\w+@gmail\.com here is where each match starts:

                   test@gmail.com
                   ^^  ^
                   ||  |
\w+?         -------|  |
\w+          --------  |
@gmail\.com  -----------

See on Regex101

(yes, it's slightly weird but both \w+? and \w+ both produce matches)

The negative lookbehind is for test and since it doesn't appear before the match, the pattern is satisfied.

You might wander what happens why does something like testfoo@gmail.com still produce a match - it has test and then other letters, right?

                   testfoo@gmail.com
                   ^^     ^
                   ||     |
\w+?         -------|     |
\w+          --------     |
@gmail\.com  --------------

See on Regex101

Same result again. The problem is that \w+ will include all letters in a match, so even if the actual string test appears, it will be in the match, not before it.

To be able to differentiate the two, you have to avoid overlaps between the lookbehind pattern and the actual matching pattern.

You can decide to define the matching pattern differently (?<!test)h\w+?\.?\w+@gmail\.com, so the match has to start with an h. In that case there is no overlap and the matching pattern will not "hide" the lookbehind and make it irrelevant. Thus the pattern will match correctly against harry.potter@gmail.com, hagrid@gmail.com but will not match testhermione@gmail.com:

              testhermione@gmail.com
              ^   ^^^     ^
              |   |||     |
(?<!test)    --   |||     |
h            ------||     |
\w+?         -------|     |
\w+          --------     |
@gmail\.com  --------------

See on Regex101

Alternatively, you can define a lookbehind that doesn't overlap with the start of the matching pattern. But beware. Remember that regexes (like most things with computers) do what you tell them, not exactly what you mean. If we use the regular expression ``(?(negative lookahead istest-` now) then we test it against test-hermione@gmai.com, we get a match for ermione@gmail.com:

              test-hermione@gmail.com
              ^     ^^     ^
              |     ||     |
(?<!test-)   --     ||     |
\w+?         --------|     |
\w+          ---------     |
@gmail\.com  ---------------

See on Regex101

The regex says that we don't want anything preceded by test-, so the regex engine obliges - there is a test- before the h, so the regular expression engine discards it and the rest of the string works to fit the pattern.

So, bottom line

  • avoid having the match overlap with the lookbehind, or it's not actually a lookbehind any more - it's part of the match.
  • be careful - the regex engine will satisfy the lookbehind but in the most literal way possible with the least effort possible.
VLAZ
  • 26,331
  • 9
  • 49
  • 67
0

In order for this to work properly you need to both:

  • Use a negative lookahead (as opposed to a lookbehind, like your example)
  • Anchor the match (to prevent partial matches. Several anchors are possible, but in your case the best is probably \b, for word boundaries)

This is the result:

\b(?!test)\w+?\.?\w+@gmail\.com

See it live!

Not a real meerkat
  • 5,604
  • 1
  • 24
  • 55