4

I am using Python 3.6.

My goal is to match a regex which may match multiple strings, overlapping and/or starting from the same position, for example:

re.findall('B.*A','BADACBA')

which gives:

['BADACBA']

But I want:

['BADACBA','BADA','BA','BA']

(the second 'BA' is because there are two instances of 'BA' in the string)

On the suggestion of How to find overlapping matches with a regexp?, I've tried encasing it in a lookahead:

re.findall('(?=(B.*A))','BADACBA')

which gives:

['BADACBA', 'BA']

which is better, but still not what I want.

I also tried the regex module:

regex.findall('B.*A','BADACBA',overlapped=True)

but it still returns:

['BADACBA', 'BA']

I haven't been able to find something that will find all matches. Since I have many such regexes, a hard-coded solution won't help much. Is there a module/function that does this?

Thanks!

boboquack
  • 249
  • 5
  • 15
  • Your pattern `B.*A` is greedy so of course it won't capture BADA when it has the chance to consume the whole string. What you're asking is not logical under the regex paradigm - you're better off writing your own engine for finding all possible matches between two substrings. – zwer Jun 20 '17 at 00:27
  • @zwer Even so, if I do `B.*?A` it still matches only the two `BA`s. – boboquack Jun 20 '17 at 00:31
  • Yes, because then it's not greedy so it won't go after BADA when it can consume just BA... As I said, what you are trying to achieve is anthithetical to the regex philosophy. Sure, you can write specific patterns that will capture specific strings, but regex is not really suitable as a universal solution in this case. – zwer Jun 20 '17 at 00:34

1 Answers1

3

As I've said above, regex is a primarily linear and single-rule-only kind of engine - you can choose between greedy capture or not, but you cannot select both. Also, most regex engines do not support overlapping matches (and even those who support it kind of fake it with substrings / forced head move) because it also doesn't fit regex philosophy.

If you're looking only for simple overlapping matches between two substrings, you can implement it yourself:

def find_substrings(data, start, end):
    result = []
    s_len = len(start)  # a shortcut for `start` length
    e_len = len(end)  # a shortcut for `end` length
    current_pos = data.find(start)  # find the first occurrence of `start`
    while current_pos != -1:  # loop while we can find `start` in our data
        # find the first occurrence of `end` after the current occurrence of `start`
        end_pos = data.find(end, current_pos + s_len)
        while end_pos != -1:  # loop while we can find `end` after the current `start`
            end_pos += e_len  # just so we include the selected substring
            result.append(data[current_pos:end_pos])  # add the current substring
            end_pos = data.find(end, end_pos)  # find the next `end` after the curr. `start`
        current_pos = data.find(start, current_pos + s_len)  # find the next `start`
    return result

Which will yield:

substrings = find_substrings("BADACBA", "B", "A")
# ['BA', 'BADA', 'BADACBA', 'BA']

But you'll have to modify it for more complex matches.

zwer
  • 24,943
  • 3
  • 48
  • 66
  • Thanks, unfortunately I was hoping to match strings like `^.{0,6}A.{6}A.{6}A.{6}[^A]`. Oh well, unless something else comes I will have to just go back to using non-regex commands. – boboquack Jun 20 '17 at 00:56
  • @boboquack - Sadly, you won't be able to do that with regex. Maybe you can create a hybrid of the above where you use regex to find the strings but then move it to the found index + 1 and so on... But it still won't solve the greedy vs non-greedy matches. – zwer Jun 20 '17 at 01:01