As the name suggests, the (?<!
sequence is a negative lookbehind. So, the rest of the pattern would match only if it's not preceded by the look behind. This is determined by where the matching starts from.
Let's start simple - we define a regex .cde.
and try to match it against some input:
First nine letters are abcdefgeh
^ ^
| |
.cde. start ------------- |
.cde. end -----------------
See on Regex101
So now we can see that the match is bcdef and is preceded by (among other characters) a
. So, if we use that as a negative lookbehind (?<!a).cde.
we will not get a match:
First nine letters are abcdefgeh
^^ ^
|| |
`(?<!a)` ----------| |
.cde. start ----------- |
.cde. end ----------------
See on Regex101
We could match the .cde.
pattern but it's preceded by a which we don't want.
However, what happens if we defined the negative lookahead differently - as (?<!b).cde.
:
First nine letters are abcdefgeh
^ ^
| |
.cde. start ----------- |
.cde. end ----------------
See on Regex101
We get a match for bcdefg because there is no b before this match. Therefore, it works fine. Yes, b is the first character of the match but doesn't show up before it. And this is the core of the lookarounds (lookbehind and lookaheads) - they are not included in the main match. In fact they fall under zero length matches since, they will be checked but won't appear as a match. In effect, they only work starting from some position but check the part of the input that will not go in the final match.
Now, if we return to your regex - (?<!test)\w+?\.?\w+@gmail\.com
here is where each match starts:
test@gmail.com
^^ ^
|| |
\w+? -------| |
\w+ -------- |
@gmail\.com -----------
See on Regex101
(yes, it's slightly weird but both \w+?
and \w+
both produce matches)
The negative lookbehind is for test and since it doesn't appear before the match, the pattern is satisfied.
You might wander what happens why does something like testfoo@gmail.com
still produce a match - it has test and then other letters, right?
testfoo@gmail.com
^^ ^
|| |
\w+? -------| |
\w+ -------- |
@gmail\.com --------------
See on Regex101
Same result again. The problem is that \w+
will include all letters in a match, so even if the actual string test appears, it will be in the match, not before it.
To be able to differentiate the two, you have to avoid overlaps between the lookbehind pattern and the actual matching pattern.
You can decide to define the matching pattern differently (?<!test)h\w+?\.?\w+@gmail\.com
, so the match has to start with an h. In that case there is no overlap and the matching pattern will not "hide" the lookbehind and make it irrelevant. Thus the pattern will match correctly against harry.potter@gmail.com, hagrid@gmail.com but will not match testhermione@gmail.com:
testhermione@gmail.com
^ ^^^ ^
| ||| |
(?<!test) -- ||| |
h ------|| |
\w+? -------| |
\w+ -------- |
@gmail\.com --------------
See on Regex101
Alternatively, you can define a lookbehind that doesn't overlap with the start of the matching pattern. But beware. Remember that regexes (like most things with computers) do what you tell them, not exactly what you mean. If we use the regular expression ``(?(negative lookahead istest-` now) then we test it against test-hermione@gmai.com, we get a match for ermione@gmail.com:
test-hermione@gmail.com
^ ^^ ^
| || |
(?<!test-) -- || |
\w+? --------| |
\w+ --------- |
@gmail\.com ---------------
See on Regex101
The regex says that we don't want anything preceded by test-
, so the regex engine obliges - there is a test- before the h, so the regular expression engine discards it and the rest of the string works to fit the pattern.
So, bottom line
- avoid having the match overlap with the lookbehind, or it's not actually a lookbehind any more - it's part of the match.
- be careful - the regex engine will satisfy the lookbehind but in the most literal way possible with the least effort possible.