Lookahead in regular expressions

Question

I was taking freecodecamp.org course on JavaScript data structures, going through the RegExp chapter. I then came across the following assertion:

"The regular expression /(?=\w{3,6})(?=\D*\d)/ will check whether a password contains between 3 and 6 characters and at least one number". (Here "check" meaning that regExp.test(password) returns true)

This seems odd to me. First of all, looking around in Stack Exchange, I found in this post that states that A(?=B) is the definition of positive lookahead, and it makes no mention that A (the preceeding expression in the parenthesis) is optional. So, shouldn't freecodecamp's example have an expression before the first lookahead?

I believe that this another example is quite similar to the previously mentioned, but simpler so I will mention it in case the explanation is simpler, too:

Why does (?=\w)(?=\d), when checked against the string "1", returns true?, Shouldn't it look for an alphanumeric character followed by a numeric character?

PS: After a thought, I hypothesized that my first example checks both lookahead patterns independently (i.e. first it checks whether the string is made of three to six characters, returns true, then checks whether there is an alpha numeric character, and finally since both searchings returned true, the whole regexp test returns true). But this doesn't seem to be coherent with the definition mentioned in the post I've linked. Is there a more general definition or algorithm which the computer "internally" uses to deal with lookaheads?

> But this doesn't seem to be coherent with the definition mentioned in the post. ... Not clear, which definition? — Dri372, Jun 14 '21 at 14:19
A(?=B): "Looks for the expression A followed by expression B" and its equivalents. — alexp9, Jun 14 '21 at 14:25
That's not even a syntactically valid regular expression because of the extra `)` at the end. — Wyck, Jun 14 '21 at 14:28
Be very careful with your use of the word _matches_ here. A _test_ of a regular expression is true if there are 1 or more _matches_. The _match_ itself is the occurrence of a matching substring, which, in the case of a lookahead-only express, will be a string of length 0 starting where the span (of 3-to-6 non-whitespace characters and at least 1 digit) begins. The expression in question doesn't match the whole input string so it's not useful. If it were bracketed with `^` _expr_ `$`, that would cause it to match the whole string. — Wyck, Jun 14 '21 at 14:36
Thanks for pointing that out. I will edit my question so that the terminology is more precise. — alexp9, Jun 14 '21 at 14:52
I don't know if I've missunderstood you, @Wyck, but if you test "1" against /(?=\w)(?=\w)/, for example here https://regex101.com/ returns true. — alexp9, Jun 14 '21 at 14:57
Yes, it returns true, but the match is a zero-length string. Try it [here](https://regex101.com/r/AGpGn5/1/) And see that it matches a 0-length string before every non-whitespace character. (FYI, `(?=\w)(?=\w)` is redundant because it just makes the same assertion twice -- that there is a non-whitespace character ahead) — Wyck, Jun 14 '21 at 15:09
Think of a positive lookahead as a normal regex except that when you get to the closing paren the scanner backs up to where you were at the start of the lookahead. In this case that is the beginning of the string. As others have pointed out, since everything is in a lookahead the matching string that is captured is empty. It seems like if you reversed your two lookaheads and removed the second lookahead it would still work and you would capture the 3-6 character password too. /(?=\D*\d)\w{3,6}/ — Chris Maurer, Jun 14 '21 at 15:32

score 0 · Accepted Answer · answered Jun 14 '21 at 16:03

Definitions

Lookaround are similar to word-boundary metacharacters like \b or the anchors ˆ and $ in that they don’t match text, but rather match positions within the text.

Positive lookahead peeks forward in the text to see if its subexpression can match, and is successful as a regex component if it can. Positive lookahead is specified with the special sequence (?=...).

Lookaround do not cosume text

An important thing to understand about lookaround constructs is that although they go through the motions to see if their subexpression is able to match, they don’t actually “consume” any text.

Examples

1: A(?=B)

Here A is indeed not optional. It is also not the part of lookahead though. As mentioned above positive lookahead is specified using (?=...). Here only B is part of the lookahead.

If you run it against AB, only A will match. It does not in any sense mean look for A. It means look for A, which has a B after it, but capture only A.

2: (?=B)

If you think about it, this regex will never actually capture anything. It will find positions which have B after them, but it will never capture B.

3: (?=\w)(?=\d)

This regex does not check if there is word character (word character in regex: a-zA-Z0-9_), followed by a digit. (?=\w) finds a position, where the next character is a word character. Then it does not consume anything and stays there.

However we also have(?=\d) after (?=\w). Since, we are at the same position, we make another check to see if that same next character is a digit. Its the same as asking, find a position where the next character is a digit and a word character. It is quite useless, since same can be achieved using just (?=\d).

Lookahead in regular expressions

1 Answers1

Definitions

Lookaround do not cosume text

Examples