1

I answered a question the other day about finding the strings that occur between two specified characters. I ended up with this fairly basic regular expression:

>>> import re
>>> def smallest_between_two(a, b, text):
...     return re.findall(re.escape(a) + "(.*?)" + re.escape(b), text)
...
>>> smallest_between_two(' ', '(', 'def test()')
['test']
>>> smallest_between_two('[', ']', '[this one][this one too]')
['this one', 'this one too']
>>> smallest_between_two('paste ', '/', '@paste "game_01/01"')
['"game_01']

However, when I went to look over it again, I realized that there was a common error that could occur when a match was partially contained inside of another match. Here is an example:

>>> smallest_between_two(' ', '(', 'here is an example()')
['is an example']

I am unsure of why it is not also finding an example, and example, as both of those also occur between a ' ' and a '('

I would rather not do this to find additional matches:

>>> first_iteration = smallest_between_two(' ', '(', 'here is an example()')
>>> smallest_between_two(' ', '(', first_iteration[0] + '(')
['an example']
user3483203
  • 50,081
  • 9
  • 65
  • 94

2 Answers2

2

I'm gonna explain why your one worked like that. For an overlapped matching, please see the answer already provided by cᴏʟᴅsᴘᴇᴇᴅ using the regex module's findall method with overlapped=True keyword argument.


Your one matches like that because the space at the Regex pattern start, matches the first space in the input, and then the non-greedy quantifier .*? matches the minimal between that space and next (. So, it is operating correctly. To better understand it, make the input string here is an example()another example().

Now, to get the shortest match in this case, you can use the zero-with negative lookahead to ensure that there is no space in between:

 (?!.* )(.*?)\(

So:

In [81]: re.findall(r' (?!.* )(.*?)\(', 'here is an example()')
Out[81]: ['example']
heemayl
  • 39,294
  • 7
  • 70
  • 76
1

You're looking for overlapping regex matching. Simply put, this is not easy to do with the default regex engine in python.

You can, however, use the regex module (pip install it first). Call regex.findall and set overlapped=True.

import regex 

a, b = ' ', '('
text = 'here is an example()'

regex.findall('{}(.*?){}'.format(*map(re.escape, (a, b))), text, overlapped=True)
['is an example', 'an example', 'example']
cs95
  • 379,657
  • 97
  • 704
  • 746