0

I am working with regex with both PHP and JavaScript. So, I was looking for some good tutorial. From this regex tutorial I have found an example for lookbehind that matches a certain 3 digit number only if it is preceded by the the word "USD". There are two different cases where a lookbehind is put both after and before the match.

Here are the regex patterns:

\d{3}(?<=USD\d{3}) //after the match
(?<=USD)\d{3} //before the match

The example string is:

USD100;

I grasped the idea but could not figure what is actually going on inside regex engine to complete the task. Can any one explain it to me easily so that I can understand. Thanks in advance.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
AL-zami
  • 8,902
  • 15
  • 71
  • 130
  • is that after the match regex works for you? – Avinash Raj Jul 13 '15 at 07:52
  • That tutorial is amazing, and [explains everything](http://www.rexegg.com/regex-disambiguation.html#lookbehind). What part didn't you understand? For `(?<=USD)\d{3}`, *The lookbehind `(?<=USD)` asserts that at the current position in the string, what precedes is the characters "USD". If the assertion succeeds, the engine matches three digits with `\d{3}`.* and for `\d{3}(?<=USD\d{3})`, *`\d{3}` matches `100`, then the lookbehind `(?<=USD\d{3})` asserts that at that position in the string, what immediately precedes is the characters "USD" then three digits.*. – Wiktor Stribiżew Jul 13 '15 at 07:52
  • lookahead is fine.My problem is with lookbehind.Mainly the internal approach of matching . – AL-zami Jul 13 '15 at 07:55
  • regex engine must have a working procedure for this match.Then how both of them gives the same result where one is after the match and other is before the match? – AL-zami Jul 13 '15 at 07:59
  • There are 2 approaches to implementing look-behind. Although I haven't looked at the code of PCRE in detail, it's likely that it uses the forward-matching method, where the implicit length of the look-behind pattern is studied from the pattern, then it attempts forward matching from (current position - pattern_length). The other approach involves matching backwards, which is what is done in .NET. The look-behind pattern is interpreted from right-to-left and match right-to-left from the current position. – nhahtdh Jul 13 '15 at 08:13

2 Answers2

3

Rexegg explains so:

(?<=USD)\d{3}
The lookbehind (?<=USD) asserts that at the current position in the string, what precedes is the characters "USD". If the assertion succeeds, the engine matches three digits with \d{3}.

\d{3}(?<=USD\d{3})
\d{3} matches 100, then the lookbehind (?<=USD\d{3}) asserts that at that position in the string, what immediately precedes is the characters "USD" then three digits.

Your question is

Then how both of them gives the same result where one is after the match and other is before the match?

The look-arounds are not equal in each of the example patterns. See

\d{3}(?<=USD\d{3})
            ^^^^^
(?<=USD)\d{3}

The first one checks if there is USD and 3 digits right before the current location in string (which is after the 3 digits)

The second one checks if there is just USD before the current location in string (which is before the 3 digits).

Here is a visualization:

       \d{3}(?<=USD\d{3})                 (?<=USD)\d{3}

Regular expression visualization and Regular expression visualization

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
3

The example below shows how PCRE (and most engines) implements look-behind. Take note of the position of the cursor just before entering the look-behind in each case.

  • In the case of \d{3}(?<=USD\d{3}), note that the cursor advances 3 positions after matching \d{3}, so the look-behind need to look past the 3 digits that it just consumes in order to check for USD in front of them.

    This method makes sure that the numbers are there first, before checking for the prefix.

    USD100;
    ^
    Attempting to match \d{3}. Fail and bump along.
    
    USD100;
     ^
    Attempting to match \d{3}. Fail and bump along.
    
    USD100;
      ^
    Attempting to match \d{3}. Fail and bump along.
    
    USD100;
       ^
    Attempting to match \d{3}.
    
    USD100;
          ^
    Matched \d{3}. Attempting to assert (?<=USD\d{3}) (length 6).
    
    USD100;
    ^     +
    Save current position. Go back 6 characters.
    (Attempt to match USD\d{3} succeeds, positive look-behind succeeds)
    
    USD100;
          ^
    Back to the saved position and report a match.
    
  • In the case of (?<=USD)\d{3}, note that the cursor is right in front of 100, so it only needs to look back 3 characters to check that USD is there.

    This method makes sure that the prefix exists first, before matching the numbers.

    USD100;
    ^
    Attempting to assert (?<=USD) (length 3). Fail length check and bump along.
    
    USD100;
     ^
    Attempting to assert (?<=USD) (length 3). Fail length check and bump along.
    
    USD100;
      ^
    Attempting to assert (?<=USD) (length 3). Fail length check and bump along.
    
    USD100;
       ^
    Attempting to assert (?<=USD) (length 3).
    
    USD100;
    ^  +
    Save current position. Go back 3 characters.
    (Attempt to match USD succeeds, positive look-behind succeeds)
    
    USD100;
       ^
    Back to the saved position. Attempting to match \d{3}.
    
    USD100;
          ^
    Matched \d{3} and report a match.
    

Look-behind is not a well-defined operation, so different engines have different implementations and limitations on what is allowed in look-behind.

  • .NET implements look-behind by matching the pattern inside look-behind from right-to-left. This makes it possible to put any construct inside look-behind, but since the tokens in the pattern are read from right-to-left, it is confusing when the pattern contains backreferences.

  • Other engines (PCRE included) chooses to match the pattern inside look-behind from left-to-right, by studying the pattern to determine the length of the pattern, and perform a match from the current position minus the length of the pattern. Since not all patterns have a bounded length, most engines reject such patterns to keep the performance reasonable.

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162