8

In trying to elaborate an answer to this question, I am now trying to come to terms with the behavior/meaning of Zero-Length regular expressions.

I often use www.regexr.com as a playground to test/debug/understand what's going on in regular expressions.

So we have this most banal scenario:

The regex is a*

The input string is dgwawa (As a matter of fact, the string here is irrelevant)

Why this behavior of reporting that this regex will match infinitely, since it matches zero occurrences of the preceding character ?

Why can't the result be 6 matches, one for each character position (since at every character, regardless of whether it is an a or not, there is a match, since zero matches is a match)?

How does it get into matching infinitely ? So it does not check/progress a character at a time?

I wonder how/where does it get itself into an infinite loop.

enter image description here

Community
  • 1
  • 1
Veverke
  • 9,208
  • 4
  • 51
  • 95

2 Answers2

19

You selected JavaScript regex flavor at regexr.com online regex tester. JavaScript regex engine does not move the index automatically when a pattern that can match an empty string is passed.

That is why when you need to emulate the behavior observed in .NET Regex.Matches, PHP preg_match_all, Python re.finditer, etc. you need to manually advance the index to test each position.

See regex101.com test:

var re = /a*/g; 
var str = 'dgwawa';
var m;
 
while ((m = re.exec(str)) !== null) {
    if (m.index === re.lastIndex) {   // <- this part
        re.lastIndex++;               // <- here
    }                                 // <- is important
    document.body.innerHTML += "'" + m[0] + "'<br/>";
}

If you remove that if block, you will get an infinite loop.

There are two very important things to mention with this regard:

  • Always use appropriate online regex tester for your programming language
  • Avoid using unanchored patterns that can match empty strings
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Nice. I was told before regexr follows a specific regex flavor. I should take this into account more seriously. – Veverke Dec 28 '15 at 15:03
  • 1
    See [*Online sandboxes (for testing and publishing regexes online)*](http://stackoverflow.com/tags/regex/info) section to select the one that you need. – Wiktor Stribiżew Dec 28 '15 at 15:06
  • Thanks for the complete answer ans insights! Excellent job. – Veverke Dec 28 '15 at 15:06
  • Just for the record, there is only one sandbox for .NET regex testing, and, compared to regexr, it is incomparably worse. So I rather stick using regexr with the highest awareness in mind :) – Veverke Dec 29 '15 at 14:51
  • 1
    For testing .NET apps, you can use [regexhero.net](http://regexhero.net/tester) and [regexstorm.net](http://regexstorm.net/tester). And if you need a functionality like regex101.com for .NET, use a very nice free app called [Expresso](http://www.ultrapico.com/expresso.htm). – Wiktor Stribiżew Dec 29 '15 at 16:27
  • This fix does't quite emulate other regex flavours e.g. `/\b|a/g` would match `a` in the string `a b c` in PHP, but with this fix it skips it – Drenai Jul 24 '22 at 17:21
1

There are actually 7 matches

Let me enumerate them, first number is the start (0 based), second number is the length

Match 1:             0       0   
Match 2:             1       0   
Match 3:             2       0   
Match 4:    a        3       1   
Match 5:             4       0   
Match 6:    a        5       1   
Match 7:             6       0   

I use regex101 and it does what most of us expect from this simple regex (given there are regex dialects).

https://regex101.com/r/mN4jA4/1

buckley
  • 13,690
  • 3
  • 53
  • 61
  • I would be happy if the site reported 7, my problem is that is reports "infinite" matches ?! I would think like you, but I am afraid the mechanism does not work like that. – Veverke Dec 28 '15 at 14:58