26

Recently I realized (by some embarrassment) that regex lookbehind assertions were not possible in Javascript.

What is the (factual) reason for the absence for this assertion so seemingly common?

I realize there are alternate ways to achieve the same thing perhaps, although Is it the basic semantics at work which forbid the functionality, or what exactly?

It also seems that some regex testing tools out there that generate Javascript code from regex patterns seem to ignore this fact — which strikes me as a bit odd.

Community
  • 1
  • 1
l'L'l
  • 44,951
  • 10
  • 95
  • 146
  • 5
    You have selected PCRE flavor at regex101.com. Select JS, and you'll get an error: https://regex101.com/r/jR9uU5/4. Also, look-behinds are too resource-consuming, I think it is the only reason they are not supported in JS. And the best look-behind workarounds IMO are here: http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript – Wiktor Stribiżew May 08 '15 at 08:05
  • JS not supporting lookbehinds is a major PITA - but it's not its only problem; JS is one of the least feature-complete regex flavors overall, unfortunately. – Lucas Trzesniewski May 08 '15 at 08:09
  • This is opinion-based because I think only people from the ECMA could answer that one with the real rationale which led to this decision. :-\ – Lucas Trzesniewski May 08 '15 at 08:17
  • Regex101 does warn you about `lookbehind`.You prabaly didnt select `javascript` – vks May 08 '15 at 08:47
  • @vks: Indeed, which is precisely what stribizhev had pointed out. I'm a bit surprised though that it still generates `Javascript` code when selected as `PCRE`. I seem to completely overlook switching to the correct flavor each I use that site. It's not the only tool that seems to allow you to use the wrong spec and generate (invalid) code. – l'L'l May 08 '15 at 09:02
  • 1
    @LucasTrzesniewski: I really think what you and stribizhev have mentioned is quite legit — it makes a lot of sense. I found what seems to be the last thing mentioned about the implementation from Waldemar [here](https://www.mail-archive.com/es-discuss@mozilla.org/msg26650.html). From the sounds of it nobody ever stepped to help make it happen... and basically why it was never put in. – l'L'l May 08 '15 at 09:06
  • @ l'L'l: Thank you for the status update. I've updated my answer as well. – nils May 08 '15 at 09:56

2 Answers2

34

Today

Lookbehind is now an official part of the ES 2018 specification. Axel Rauschmayer gives a good introduction in his blog post.

History

It looks like at the time, Brendan Eich wasn't aware of its existence (because Netscape was built on an older version of Perl):

This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks.

I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org.

If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this.

/be

There have been a bunch of different on the mailing list with attempts to include it, but it still seems to be a pretty complex feature performance-wise, because EcmaScript Regular Expressions are backtracking based and backtracking is needed in lookbehind when working with capturing groups. This can lead to problems such as catastrophic backtracking when used incorrectly.

At some point it was suggested for ES6/Es 2015, but it never made the draft, let alone the specification. In the last post in the discussion, it seems that nobody took up the task of implementing it. If anybody feels called to write an implementation, they can sign up for the ES Discuss list and propose it.

Update May 2015:

In May 2015, Nozomu Katō has proposed an ES7 look-behind implementation.

Update September 2015:

Regex Look-behind was added as a stage 0 proposal.

Update May 2017:

The proposal is now at stage 3. This means that now at least two browsers need to implement it for it to become a part of the next EcmaScript standard. As @martixy mentioned in the comments, Chrome has implemented it behind the JS experimental flag.

Community
  • 1
  • 1
nils
  • 25,734
  • 5
  • 70
  • 79
  • 2
    How is the MSDN link has anything to do with look-behind assertion here? And how can look-behind cause catastrophic backtracking? – nhahtdh May 08 '15 at 08:37
  • MSDN is also a good resource, why not referring it? This reply is really answering the question, I think. A remark that you chose PCRE at regex101.com is not really that helpful. +1. – Wiktor Stribiżew May 08 '15 at 08:44
  • @nhahtdh i guess variable lookbehind can cause catastrophic backtracking similar to lookahead!!! – vks May 08 '15 at 08:45
  • 1
    @vks: Depending on the implementation, variable-length look-behind may not cause backtracking. Catastrophic backtracking is usually cause by other parts of the pattern. – nhahtdh May 08 '15 at 08:47
  • 1
    @stribizhev: Of course, you can reference it, but does it has anything to do with the problem at hand? I can't find it says anything about look-behind. – nhahtdh May 08 '15 at 08:47
  • @nhahtdh: Now, you should admit you are being too picky. There is a term "backtracking", there is a link to it. This is really not a problem *at all*. – Wiktor Stribiżew May 08 '15 at 08:49
  • @nhahtdh: If you can give me an example of variable-length lookbehind without backtracking I'll gladly update my post. – nils May 08 '15 at 08:51
  • @nils: I think you should elaborate on that point, instead of throwing the responsibility to me. The articles don't provide sufficient backing for your statement. – nhahtdh May 08 '15 at 08:52
  • @nhahtdh: Which statement exactly? That lookbehind uses backtracking? While I have been able to find certain scenarios where lookbehind works without backtracking (eg. https://msdn.microsoft.com/en-us/library/e902768h%28v=vs.90%29.aspx), I haven't found an engine that doesn't use it at all. So I would be very grateful if you could provide me with a real-world example. – nils May 08 '15 at 08:59
  • 1
    @nils: The pattern inside lookbehind (or any lookaround) is never backtracked into -- it means that once you exit the lookaround, you don't backtrack inside it again, but rather to before the lookaround. However, during the proceed of verifying a lookaround, apart from the backtracking done inside the pattern, the lookbehind itself may be implemented with backtracking or not. – nhahtdh May 08 '15 at 09:03
  • @nhahtdh Thank you for clarifying that. I've updated my post with a reference to that fact that in ES, backtracking is needed for capturing groups. I hope that suffices. – nils May 08 '15 at 09:06
  • @nhahtdh if you have something like `.*` inside lookahead or variable `lookbehind` then it will backtrack. – vks May 08 '15 at 09:08
  • @nils, thanks - It appears most everything you've mentioned appears to be tenable. Unless Waldemar or Brendan Eich happen to drop by and state anything different then that's all we know really. It's somewhat surprising to find out the reason — it's obviously quite doable the devs are receptive about including it; problem is nobody helped them enough to make it happen. – l'L'l May 08 '15 at 10:21
  • 2
    The good folk from the V8 project have implemented the feature already. It is, since a few days ago, available in the stable Chrome v49 behind the Experimental JavaScript flag. (http://v8project.blogspot.bg/2016/02/regexp-lookbehind-assertions.html) – martixy Mar 12 '16 at 15:26
  • Lookbehind landed in Chrome 62 (https://groups.google.com/forum/#!msg/v8-users/r-SN2yuKTL8/pfwrSuqSBQAJ) – Kevin Feb 28 '18 at 05:19
12

To speak from the conclusion, I think look-behind is not implemented in JavaScript, since no one has any idea how it should behave, and existing implementations show that adding support for look-behind is rather complex.

JavaScript/ECMAScript is different from other languages in the sense that the specification includes an abstract implementation of the regex engine, while most other language only stops short at description of behavior of each pieces of regex syntax, with scant description of how different tokens interacts with each other.

Look-ahead? Easy to implement

The implementation of look-ahead is quite straight-forward. You only need to treat the pattern inside the look-ahead in the same manner as pattern outside look-ahead, and perform a left-to-right match as per usual, except that after the look-ahead succeeds 1) the current position is restored to before entering the look-ahead, and 2) choice points inside look-ahead are discarded after it is matched.

There is no limit to what can be included inside look-ahead, since it is a very simple extension to the existing natural left-to-right matching facilities.

Look-behind? Not so easy

On the other hand, the implementation of look-behind is not as straight forward.

Imagine how you would implement the following look-behind construct:

(?<=fixed-string)
(?<=a|fixed|string)
(?<=t[abc]{1,3})
(?<=(abc){2,6})
(?<=^.*abc.*)
(?<=\G"[^"]+");
(?<=^(.....|.......)+)
\b(\w+)\b(?<!\b\1\b.*\1)

Apart from the basic case (?<=fixed-string), which any look-behind implementation must support, (?<=a|fixed|string) is a much desirable case to support.

Different regex engine has varied level of support for the regex above.

Let us look at how they are implemented in .NET and Java. (This is the two flavors whose look-behind behavior I have studied.)

.NET implementation

In Microsoft .NET implementation, all those regex above are valid, since .NET implements look-behind by using right-to-left mode, with the starting offset at the current position. The look-behind construct doesn't generate any choice point by itself.

However, if you use capturing groups inside the look-behind, it starts to get confusing, since the atoms in the patterns are interpreted from right-to-left, as demonstrated in this post. This is the disadvantage of this method: you would need to wrap your mind to think right-to-left when writing a look-behind.

Java implementation

In contrast, Java regex implementation implements look-behind by reusing the left-to-right matching facilities.

It first analyzes the pattern inside the look-behind for the minimum and maximum length of the pattern. Then, look-behind is implemented by trying to match the pattern inside from left-to-right, starting from (current position - minimum length) to (current position - maximum length).

Is there anything missing? Yes! Since we are matching from left-to-right, we need to make sure that the match ends right at the position before entering the look-behind (current position). In Java, this is implemented by appending a node at the end of the pattern inside look-behind.

This implementation is very inefficient, as there are maximum - minimum + 1 choice points created in the look-behind itself, before we even talk about choice points created by the pattern inside the look-behind.

The look-behind bound check is also inefficient, since it is placed at the end of the pattern, and can't prune choice points that are clearly hopeless (those already far exceeding the current position in the middle of pattern).

Summary

As you can see, adding support for look-behind is not easy:

  • The right-to-left approach seems reasonably efficient. However, it requires additional specification on the right-to-left matching behavior on other existing constructs.
  • The approach to reuse left-to-right matching facilities is complex to specify, and is very inefficient. It also requires the introduction of pattern analysis into the specification, lest the performance is thrown out of the window.

(Note that I have not yet covered the behavior when look-behind is used inside look-ahead, and vice-versa. This should also be taken into consideration when defining semantics for look-behind construct).

These technical hurdles are also mentioned by Waldemar Horwat (who wrote the ES3 regex spec) in the mail cited in nils' answer:

No one has yet submitted a well-defined proposal for lookbehinds on the table. Lookbehinds are difficult to translate into the language used by the spec and get quite fuzzy when the order of evaluation of parts of the regexp matters, which is what happens if capturing parentheses are involved. Where do you start looking for the lookbehind? Shortest first, longest first, or reverse string match? Greedy or not? Backtrack into capturing results?

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • 1
    Thank you for this insight/explanation. So if I understand your answer correctly, it seems as if the entire way `Javascript` handles `regex` is somewhat inefficient and it would be highly unlikely that they would implement lookbehind assertions because of this? – l'L'l May 08 '15 at 12:24