72

For example,the regex below will cause failure reporting lookbehind assertion is not fixed length:

#(?<!(?:(?:src)|(?:href))=["\']?)((?:https?|ftp)://[^\s\'"<>()]+)#S

Such kind of restriction doesn't exist for lookahead.

user202729
  • 3,358
  • 3
  • 25
  • 36
wamp
  • 5,789
  • 17
  • 52
  • 82
  • What reference are you using for “lookbehind assertion MUST be fixed length”? – Alex Sep 26 '10 at 03:28
  • 1
    **lookbehind assertion is not fixed length** will cause failure,can't we infer it from that? – wamp Sep 26 '10 at 03:35
  • 1
    What regex engine are you using? Perl? C#? PHP? There are lots of tools out there that handle regexes, and all have their own quirks – Yuliy Sep 26 '10 at 03:37
  • I'm using PCRE. IMO,`lookahead` and `lookbehind` should be similar,why different here? – wamp Sep 26 '10 at 03:38
  • What I'm reading says "lookaround is zero-width" as in both directions not just lookbehind. – Alex Sep 26 '10 at 03:40
  • @AlexW,I never heard of `lookaround` assertion, what's that? – wamp Sep 26 '10 at 03:41
  • http://www.regular-expressions.info/lookaround.html Look at the part regarding Lookaround Is Atomic. – Alex Sep 26 '10 at 03:44
  • 2
    I've read that,lookaround is just a abbreviation for lookahead and lookbehind.But my question is still not answered. – wamp Sep 26 '10 at 03:57

5 Answers5

97

Lookahead and lookbehind aren't nearly as similar as their names imply. The lookahead expression works exactly the same as it would if it were a standalone regex, except it's anchored at the current match position and it doesn't consume what it matches.

Lookbehind is a whole different story. Starting at the current match position, it steps backward through the text one character at a time, attempting to match its expression at each position. In cases where no match is possible, the lookbehind has to go all the way to the beginning of the text (one character at a time, remember) before it gives up. Compare that to the lookahead expression, which gets applied exactly once.

This is a gross oversimplification, of course, and not all flavors work that way, but you get the idea. The way lookbehinds are applied is fundamentally different from (and much, much less efficient than) the way lookaheads are applied. It only makes sense to put a limit on how far back the lookbehind has to look.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • 1
    So why can't you just look for the look-behind first, and then find the rest of the pattern? – Lysol Dec 05 '15 at 03:59
  • 13
    @AidanMueller: Some flavors (including PCRE) support the `\K` construct, which does just that: `foo\Kbar` has to match `foo` first, but pretends the match really started at `bar`. But that only works for *positive* lookbehinds. – Alan Moore Dec 05 '15 at 06:19
  • 2
    @AlanMoore That makes sense. Now one thing I'm not sure on. Does look-behind look for what is immediately preceding? Or does it just mean "somewhere before the pattern"? – Lysol Dec 06 '15 at 05:36
  • @AidanMueller: Immediately preceding. To put it another way, whatever is matched by the lookbehind has to end exactly where the next part of the regex *starts* matching. – Alan Moore Dec 06 '15 at 18:07
  • @AlanMoore I still don't understand why it has to be fixed-length though? Why can't positive look-behinds simply be marked as not part of the actual match? – Lysol Dec 09 '15 at 20:21
  • @AidanMueller: Well, greediness is one problem. Consider the Java regex `(?<=f\w{0,4})\w*r` applied to the string `foobar`. If it worked as you described, the result would be just `r`, but it matches `oobar`. In PCRE you can use `(?<=fooba|foob|foo|fo|f)\w*r` and the result will still be `oobar`, proving that lookbehind behavior trumps ordered alternation as well as greediness. Even in .NET, which places no restrictions at all on lookbehinds, `(?<=f\w*)\w*r` matches `oobar`. – Alan Moore Dec 10 '15 at 06:38
  • 2
    so technically its possible to have variable length lookbehind expression, its only not allowed because it may cause performance problems? – M.kazem Akhgary Jan 20 '18 at 13:22
  • The "fundamentally different" part looks wrong. It should usually be possible to "reverse" the regex and match it against the reversed string, just that the implementation of pcre in particular doesn't do it. – user202729 Feb 22 '21 at 02:34
  • @user202729 Yes, the question was in the context of PCRE. Other flavors have slightly different restrictions, while .NET treats lookahead and lookbehind exactly the same. – Alan Moore Mar 23 '21 at 07:22
13

First of all, this isn't true for all regular expression libraries (like .NET).

For PCRE, the reason appears to be:

The implementation of lookbehind assertions is, for each alternative, to temporarily move the current position back by the fixed width and then try to match.

(at least, according to http://www.autoitscript.com/autoit3/pcrepattern.html).

mbeckish
  • 10,485
  • 5
  • 30
  • 55
  • 1
    Why not use the same algorithm for `lookahead` and `lookbehind`? Isn't the prototype the same? – wamp Sep 26 '10 at 03:42
  • 2
    wamp, then you'd have to *reverse* the regex in the lookbehind and step backwards. Regular expressions usually only work forwards and reversing a particular expression is likely nontrivial. – Joey Jan 27 '12 at 10:16
  • They were able to implement a fix size checker (try `#(?<=fw(*SKIP)(*FAIL)|f)oo#`) while right-to-left capability is sorely lacking. – Dávid Horváth Feb 01 '17 at 22:10
  • I think that technically it would be feasible to reverse many variable-width expressions. It doesn't work for all of them, but it works for a great deal of useful expression, so that it would be a partial solution to the problem. – jlh Oct 04 '18 at 15:13
13

PCRE doesn't support floating lookbehind because it can cause major performance problems. This is because of the lack of right-to-left matching capability: PCRE can start a branch only from a fixed left, but left of a variable-length lookbehind can not be fixed.

Generally, try to branch your lookbehind part to fixed length patterns if possible. For example instead of:

(?<=(src|href)=")etc.

(1) use this:

(?:(?<=src=")|(?<=href="))etc.

(2) Or with \K:

(src|href)="\Ketc.

Note that \K is not a real lookbehind, because it always starts search at the end of previous match (no potential backstep into the previous match).

(3) In some complex lookbehind-only cases you can search with an "inverted" lookahead expression in a reversed string. Not too elegant but it works:

.cte(?="=(ferh|crs))
Dávid Horváth
  • 4,050
  • 1
  • 20
  • 34
5

I had the same issue and fixed it by using (?: subexpression)

Defines a noncapturing group. such as Write(?:Line)? "WriteLine" in "Console.WriteLine()" "Write" in "Console.Write(value)"

I had to change the Regex below which is suppose to catch before , or something in the start of string which was giving me lookbehind assertion is not fixed length.

(?<=,|^)

with this,

(?:(?<=,)|^)
Mehrad
  • 4,093
  • 4
  • 43
  • 61
0
grep -P '(?<=((three)|(one)) )two' <<< "one two three three two one"
grep: lookbehind assertion is not fixed length

grep -P '((?<=(three) )|(?<=(one) ))two' <<< "one two three three two one"
one two three three two one

For processing efficiency PCRE does not support right-to-left matching or recursion. When doing a lookbehind PCRE searches the end of any previous matching string - implementing variable size matches would require recursion and reduce efficiency. See: Look Behind Assertions

Adam D.
  • 159
  • 1
  • 5
  • 2
    Thank you for your answer! Can you please provide some context or additional information, instead of just the commands? That will make the answer more useful for other people looking for information. – roelofs Dec 10 '17 at 23:58