How to Find All Matches in Regular Expressions when one Overlaps OR Contains the Other?

Question

The question of how to find every match when they might overlap was asked in Overlapping matches in Regex. However, as far as I can see, the answers there does not cover a more general case.

How can we find all substrings that begin with "a" and end with "z"? For example, given "akzzaz", it should find "akz", "akzz", "az" and "akzzaz".

Since there may be more than one match starting at the same position, ("akz" and "akzz") and also there may be more than one match ending at the same position ("az" and "akzzaz") I cannot see how using a lookahead or lookbehind helps as in the mentioned link. (Also, please bear in mind that in the general case "a" and "z" might be more complex regular expressions)

I use C#, so, in case it matters, having any feature specific to .Net Regular Expressions is OK.

In general k would stand for any non-a and non-z character but I don't think it would make a difference in the answer, so, yes, we can assume that. — Ali Ferhat, Feb 04 '12 at 11:48

score 1 · Answer 1 · answered Feb 04 '12 at 11:51

1

For your current problem, string.startwith and string.endwith would do be a better job. Regular Expression is not necessarily faster in all cases.

answered Feb 04 '12 at 11:51

user1058272

113
2
9

For the specific question, you may have a point. However I'm trying to understand regular expressions and their limitations. In the general case we might have wanted to find all such substrings that start with `a`, end with `z`, with the characters in between having a feature expessible by regular expressions. – Ali Ferhat Feb 04 '12 at 12:17

score 1 · Accepted Answer · answered Feb 04 '12 at 12:55

Regular expressions are designed to find one match at a time. Even a global match operation is simply repeated applications of the same regex, each starting at the end of the previous match in the target string. So no, regexes are not able to find all matches in this way.

I will stick my neck out and say that I don't believe you can even find "all strings beginning with 'a' in 'akzzaz'" with a regex. /(a.*)/g will find the entire string, while /(a.*?)/g will find just 'a' twice.

The way I would code this would be to locate all 'a's, and search each of the substrings from there to the end of the string for all 'z's. So search 'akzzaz` and 'az' for 'z', giving 'akz', 'akzz', 'akzzaz', and 'az'. That is a fairly simple thing to do, but not a job for a regex unless the actual 'a' and 'z' tokens are complex.

Yes, that (it is impossible) is my gut feeling too. But my gut feelings have an emberassing rate of being proven wrong. So let's wait and see if someone can come up with a more definite answer. — Ali Ferhat, Feb 04 '12 at 13:05
Mine is a definite answer! I am sure what you want to do is impossible because applying a regex produces a *single* match. Applying the same regex to the same string again will produce the same match. The only thing you can change is *where* in the string the search starts, so a regex cannot find more than one match starting at the same place. Is that clearer? — Borodin, Feb 04 '12 at 13:11

score 0 · Answer 3 · answered Feb 20 '13 at 19:12

I think it's worth noting that there is actually a way for a regex to return more than one match at the same time. Although this doesn't answer your question, I think this would be a good place to mention this for others who may run into a similar situation. The regex below for example would return all the right substrings of a string with a single match and has them in different capturing groups:

(?=(\w+)).

This regex uses capturing groups inside a zero-width assertion and for each match at position i(each character) the capturing group is a substring of length n-i.
Doing anything that would require the regex engine to stay in the same place after a match is probably overkill for a regular expression approach.

score 0 · Answer 4 · answered Feb 04 '12 at 11:51

0

Try this regular expression

a[akz]+z - in case a, k and z are the only characters
a[a-z]+z - in case of any alphabet

answered Feb 04 '12 at 11:51

Sunil Kumar B M

2,735
1
24
31

Nope. It only finds "az" and "akzzaz" (with changing `+` to a `*`) – Ali Ferhat Feb 04 '12 at 12:29

How to Find All Matches in Regular Expressions when one Overlaps OR Contains the Other?

4 Answers4