7

I'm new to learning Regular Expressions, and I came across this answer which uses positive lookahead to validate passwords.

The regular expression is - (/^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])[0-9a-zA-Z]{8,}$/) and the breakdown provided by the user is -

(/^
(?=.*\d)                //should contain at least one digit
(?=.*[a-z])             //should contain at least one lower case
(?=.*[A-Z])             //should contain at least one upper case
[a-zA-Z0-9]{8,}         //should contain at least 8 from the mentioned characters
$/)

However, I'm not very clear on chaining multiple lookaheads together. From what I have learned, a positive lookahead checks if the expression is followed by what is specified in the lookahead. As an example, this answer says -

The regex is(?= all) matches the letters is, but only if they are immediately followed by the letters all

So, my question is how do the individual lookaheads work? If I break it down -

  1. The first part is ^(?=.*\d). Does this indicate that at the starting of the string, look for zero or more occurrences of any character, followed by 1 digit (thereby checking the presence of 1 digit)?
  2. If the first part is correct, then with the second part (?=.*[a-z]), does it check that after checking for Step 1 at the start of the string, look for zero or more occurrences of any character, followed by a lowercase letter? Or are the two lookaheads completely unrelated to each other?
  3. Also, what is the use of the ( ) around every lookahead? Does it create a capturing group?

I have also looked at the Rexegg article on lookaheads, but it didn't help much.

Would appreciate any help.

  • 1
    Lookarounds are zero-length assertion, In your case they all match from the begining of the string. – Toto Nov 05 '17 at 11:29
  • The keyword here is not the lookahead, but ***backtracking***: `(?=.*\d)` looks for a complete line ( `.*`), then *backtracks* to find at least one number (`\d`). This is repeated throughout the different lookaheads (and could even be optimized, but this is a whole other story). – Jan Nov 05 '17 at 11:31

2 Answers2

2

As mentionned in the comments, the key point here are not the lookaheads but backtracking: (?=.*\d) looks for a complete line (.*), then backtracks to find at least one number (\d).


This is repeated throughout the different lookaheads and could be optimized like so:
(/^
(?=\D*\d)                // should contain at least one digit
(?=[^a-z]*[a-z])         // should contain at least one lower case
(?=[^A-Z]*[A-Z])         // should contain at least one upper case
[a-zA-Z0-9]{8,}          // should contain at least 8 from the mentioned characters
$/)

Here, the principle of contrast applies.

Jan
  • 42,290
  • 8
  • 54
  • 79
1

Assertions are atomic, independent expressions with separate context
from the rest of the regex.

It is best visualized as: They exist between characters.
Yes, there is such a place.

Being independent though, they receive the current search position,
then they start moving through the string trying to match something.
They literally advance their private (local) copy of the search position
to do this.

They return a true or false, depending on if they matched something.
The caller of this assertion maintains it's own copy of the search position.
So, when the assertion returns, the callers search position has not changed.

Thus, you can weave in and out of places without having to worry about
the search position.

You can see this a little more dramatically, when assertions are nested:

Target1: Boy1 has a dog and a train
Target2: Boy2 has a dog

Regex: Boy\d(?= has a dog(?! and a train))

Objective: Find the Boy# that matches the regex.


Other noteworthy things about assertions:

They are atomic (ie: independent) in that they are immune to backtracking
from external forces.

Internally, they can backtrack just like anywhere else.
But, when it comes to the position they were given, that cannot change.

Also, inside assertions, it is possible to capture just like anywhere else.
Example ^(?=.*\b(\w+)\b) captures the last word in string, however the search position does not change.

Also, assertions are like a red light. The immediate expression that follows the assertion
must wait until it gets the green light.
This is the result the assertion passes back, true or false.

  • I was going to throw my hat into this question: (before the bounty was announced) https://codereview.stackexchange.com/questions/180203/trimming-unicode-whitespace-control-characters-and-line-breaks-with-regex but I believe I am under-qualified to offer an air-tight review. I would like to ask you to post an answer (for my educational benefit). I'd like to see how you set up test cases and your patterns & logic. ...I didn't want to go to the trouble to set up a Chat room, I'll delete this comment after you acknowledge (reply/upvote) it. – mickmackusa Nov 15 '17 at 01:43