2

I am implementing a regular expression engine and have encountered an interesting gotcha; If you attempt to match the expression /(?>a)*/ against "a" you theoretically have an infinite number of positive zero width lookahead matches at index 0.

My question is: is even reasonable to match quantified zero width matches? Should I let this run infinitely and blame the person who wrote the expression or should I catch and deny this type of match?

Edit: Or maybe just one single match and ignore the fact that it asked for more?

Edit 2: Currently, my engine sees the zero width match, adds it to the result (zero characters), stays at the same index, and finally goes back to the same zero width expression as many times as possible (which is unbounded when used with *, +, {n,}, etc).

  • I think one single match. See also http://stackoverflow.com/a/2973495/2908724 (specifically) and http://regular-expressions.mobi/lookaround.html (generally). – bishop Jun 14 '16 at 13:14
  • Neither of those specifically addressed quantified lookarounds. I'm already plenty familiar with the use of them and the lookahead was just an example. There are other possible zero width matches such as or-ing an empty expression or a question mark. – Richard Robertson Jun 14 '16 at 13:18
  • I think you shouldn't allow quantifiable lookarounds - either one or none. IIRC PERL-Regexes don't allow this either. If you however have to implement them, you could internally discard them, if the quantifier allows 0 times, otherwise check them once. – Sebastian Proske Jun 14 '16 at 13:27
  • Copying existing systems seems to be the best way to approach this. I was unable to get any other systems to infinitely match zero width. – Richard Robertson Jun 14 '16 at 13:30

1 Answers1

0

Consensus is that no, it is not reasonable to allow more than one match of zero width.