2

Is it possible to find overlapping matches using regular expressions when searching again for the same pattern? I want to be able to find matches that occurs three times. For example babab occurs three times in babababab:

babababab

babababab

babababab

This is my current Python implementation:

import re
matches = re.findall(r'(?=(\w+).*\1).*\1', "babababab")
print(matches)

My program find only baba instead of babab. Thanks!

Kali User
  • 105
  • 6
  • Does this answer your question? [How to find overlapping matches with a regexp?](https://stackoverflow.com/questions/11430863/how-to-find-overlapping-matches-with-a-regexp) – JvdV Feb 19 '20 at 08:05
  • Thank you for the link JvdV, but it doesn't answer the case where I need to use back references like in the question I posted – Kali User Feb 19 '20 at 08:19

2 Answers2

2

One trick you may use here is to actually just match on ba(?=bab), which would only consume ba, allowing the regex engine to shift forward logically by just one match:

matches = re.findall(r'ba(?=bab)', "babababab")
matches = [i + 'bab' for i in matches]
print(matches)

This prints:

['babab', 'babab', 'babab']

Note that I concatenate the tail bab to each match, which is fine, because we know the actual logic match was babab.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • That's a pretty nice trick. Does it work when I don't know the inputs beforehand? Lets' say for example finding the longest substring that is at the same time a prefix and a suffix? – Kali User Feb 19 '20 at 06:08
  • Obviously my answer assumes that have some knowledge on how the overlapping works out. For a more general solution, you could just write a string parser which walks down the string checking for matches, one character step at a time. For your exact question, I happen to prefer the above regex solution, as it only requires a one-liner. – Tim Biegeleisen Feb 19 '20 at 06:09
  • I was hopping to implement everything using only the native regex engine because of performance constraints I have, but it seems like some extra computation with Python code itself is required to complement what the re module offers. Thank you! – Kali User Feb 19 '20 at 06:20
  • Nice @TimBiegeleisen, upvoted. Another way would be the capture group inside a positive lookahead – JvdV Feb 19 '20 at 07:33
1

We can generalize the solution to any regex.

Let's say we have a valid regex pattern which you want to search for overlapping matches.

In order to get overlapping matches, we need to avoid consuming characters in each match, relying on the bump-along mechanism to evaluate the regex on every position of the string. This can be achieved by surrounding the whole regex in a look-ahead (?=<pattern>), and we can nest a capturing group to capture the match (?=(<pattern>)).

This technique works for Python re engine since after it found an empty match, it will simply bump-along and will not re-evaluate the regex at the same position but looking for non-empty match on the second try like PCRE engine.

Sample code:

import re

inp = '10.5.20.52.48.10'
matches = [m[0] if type(m) is tuple else m for m in re.findall(r'(?=(\d+(\.\d+){2}))', inp)]

Output:

['10.5.20', '0.5.20', '5.20.52', '20.52.48', '0.52.48', '52.48.10', '2.48.10']

If the original pattern doesn't have numbered backreferences then we can build the overlapping version of the regex with string concatenation.

However, if it does, the regex will need to be modified manually to correct the backreferences which have been shifted by the additional capturing group.

Do note that this method doesn't give you overlapping matches starting at the same index (e.g. looking for a+ in aaa will give you 3 matches instead of 6 matches). It's not possible to implement overlapping match starting at the same index in most regex flavors/library, except for Perl.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • I posted a solution stating exactly `(?=())`. Only to realize later you have covered this perfectly. Upvoted =) – JvdV Feb 19 '20 at 08:04
  • Thank you guys. I am trying to use a case with numbered backreferences as in my original question, but it doesn't capture the overlapping character. My pattern is `(\w+).*\1`, which means a repeated word: `[m[0] if type(m) is tuple else m for m in re.findall('(?=((\w+).*\1))', "babababab")]` returns an empty list – Kali User Feb 19 '20 at 08:17
  • @KaliUser As mentioned, numbered backreferences will have to be modified since the capturing group number has changed. In your case `(?=((\w+).*\2))` – nhahtdh Feb 19 '20 at 08:45
  • @nhahtdh thanks, I also tried that but I keep getting the empty result `[m[0] if type(m) is tuple else m for m in re.findall('(?=((\w+).*\2))', "babababab")]` – Kali User Feb 19 '20 at 08:53
  • @KaliUser You need to use raw string `r'(?=((\w+).*\2))'` since `'\2'` is escape sequence for `'\x02'` – nhahtdh Feb 19 '20 at 08:56
  • @nhahtdh I was not able to make it work for other inputs like `aabaaabaaaaab`. The reason is that the answer sometimes is in `m[0]` and another times it is in `m[1]`. I posted a related follow up question at https://stackoverflow.com/questions/60299360/fast-way-of-finding-the-longest-substring-using-only-regex – Kali User Feb 19 '20 at 11:34