4

I have strings of as and bs. I want to extract all overlapping subsequences, where a subsequence is a single a surrounding by any number of bs. This is the regex I wrote:

import re

pattern = """(?=            # inside lookahead for overlapping results
             (?:a|^)        # match at beginning of str or after a
             (b* (?:a) b*)  # one a between any number of bs
             (?:a|$))       # at end of str or before next a
          """
a_between_bs = re.compile(pattern, re.VERBOSE)

It seems to work as expected, except when the very first character in the string is an a, in which case this subsequence is missed:

a_between_bs.findall("bbabbba")
# ['bbabbb', 'bbba']
a_between_bs.findall("abbabb")
# ['bbabb']

I don't understand what is happening. If I change the order of how a potential match could start, the results also change:

pattern = """(?=
             (?:^|a)        # a and ^ swapped
             (b* (?:a) b*)
             (?:a|$))
          """
a_between_bs = re.compile(pattern, re.VERBOSE)

a_between_bs.findall("abbabb")
# ['abb']

I would have expected this to be symmetric, so that strings ending in an a might also be missed, but this doesn't appear to be the case. What is going on?

Edit:

I assumed that solutions to the toy example above would translate to my full problem, but that doesn't seem to be the case, so I'm elaborating now (sorry about that). I am trying to extract "syllables" from transcribed words. A "syllable" is a vowel or a diphtongue, preceded and followed by any number of consonants. This is my regular expression to extract them:

vowels = 'æɑəɛiɪɔuʊʌ'
diphtongues = "|".join(('aj', 'aw', 'ej', 'oj', 'ow'))
consonants = 'θwlmvhpɡŋszbkʃɹdnʒjtðf'

pattern = f"""(?=
          (?:[{vowels}]|^|{diphtongues})
          ([{consonants}]* (?:[{vowels}]|{diphtongues}) [{consonants}]*)
          (?:[{vowels}]|$|{diphtongues})
          )
          """
syllables = re.compile(pattern, re.VERBOSE)

The tricky bit is that the diphtongues end in consonants (j or w), which I don't want to be included in the next syllable. So replacing the first non-capturing group by a double negative (?<![{consonants}]) doesn't work. I tried to instead replace that group by a positive lookahead (?<=[{vowels}]|^|{diphtongues}), but regex won't accept different lengths (even removing the diphtongues doesn't work, apparently ^ is of a different length).

So this is the problematic case with the pattern above:

syllables.findall('æbə')
# ['bə'] 
# should be: ['æb', 'bə']

Edit 2: I've switched to using regex, which allows variable-width lookbehinds, which solves the problem. To my surprise, it even appears to be faster than the re module in the standard library. I'd still like to know how to get this working with the re module, though. (:

  • Are the results when `a` and `^` are swapped correct? – Scott Hunter May 29 '19 at 19:49
  • I don't understand how the first `findall()` is working. It's not supposed to return overlapping matches. – Barmar May 29 '19 at 20:13
  • When processing alternatives, some regexp engines will try to match the longest alternative, some will prefer the first matching alternative. In the second case, order matters, and that's obviously how Python works. – Barmar May 29 '19 at 20:16
  • @ScottHunter I reproduced it. – Barmar May 29 '19 at 20:17
  • @Barmar The matching is done inside a lookahead. So technically the match is the (empty) string _before_ the b*ab* sequence, so nothing is consumed and I can get overlapping matches. ScottHunter, the results are incorrect in both cases. The result should be "abb" and "bbabb", but only one of them is returned. – Christian Adam May 30 '19 at 08:41
  • I saw the lookahead, but I thought it just for the `a|^`, I didn't notice that it wrapped around the entire thing. Should have realized that was wrong, since it should have been a lookbehind to do what I thought it was doing. – Barmar May 30 '19 at 15:20

2 Answers2

1

I suggest fixing this with a double negation:

(?=         # inside lookahead for overlapping results
 (?<![^a])  # match at beginning of str or after a
 (b*ab*)    # one a between any number of bs
 (?![^a])   # at end of str or before next a
)

See the regex demo

Note I replaced the grouping constructs with lookarounds: (?:a|^) with (?<![^a]) and (?:a|$) with (?![^a]). The latter is not really important, but the first is very important here.

The (?:a|^) at the beginning of the outer lookahead pattern matches a or start of the string, whatever comes first. If a is at the start, it is matched and when the input is abbabb, you get bbabb since it matches the capturing group pattern and there is an end of string position right after. The next iteration starts after the first a, and cannot find any match since the only a left in the string has no a after bs.

Note that order of alternative matters. If you change to (?:^|a), the match starts at the start of the string, b* matches empty string, ab* grabs the first abb in abbabb, and since there is a right after, you get abb as a match. There is no way to match anything after the first a.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I think I get why replacing the first grouping with a lookbehind helps, but it doesn't match how I understand your explanation? With `(?:a|^)`, the first `a` is matched. In the remaining string, there is no `b*ab*` that is _preceded_ by an `a` or `^`, so no more results are returned. For `(?:^|a)`, I don't understand what's happening. It matches `^` and grabs `abb`. Now there's `abbabb` left, so `a` matches the first group `bbabb` the captured group and `$` matches the end of the string. Why is this not working? – Christian Adam May 30 '19 at 09:20
  • The double negative unfortunately does not work for my actual problem (I edited the post above to include it). Sorry about the confusion, I thought a minimal example would be helpful, but it wasn't complex enough. – Christian Adam May 30 '19 at 10:46
  • I added an example where my solution fails. Any ideas? – Christian Adam May 31 '19 at 14:50
  • @ChristianAdam *Now there's `abbabb` left* - you are wrong here. Once the `abb` match is found, the regex engine advances its index to the position after the first character before trying the pattern again, so there is `bbabb` left, and since there is one `a` there can be no second match. If I knew you have multicharacter strings, not just `a`, I would also recommend PyPi regex module, the double negation is a trick to overcome the fact you may only use fixed width patterns in Python `re` lookbehinds. – Wiktor Stribiżew May 31 '19 at 20:28
  • Hm. Ok, so after the first match (starting with the first `a`), the engine advances to the next chaaracter, `b` in `bbabb`. This matches the `b*ab*` pattern. Why does the lookbehind not find the preceding `a`? Does it only operate over the remaining string? – Christian Adam Jun 01 '19 at 12:49
  • @ChristianAdam The *lookbehind* [finds](https://regex101.com/r/FO8FOr/1) `bbabb`. It is your non-capturing group that does not find it. – Wiktor Stribiżew Jun 01 '19 at 18:49
0

Remember that python "short-circuits", so, if it matches "^", its not going to continue looking to see if it matches "a" too. This will "consume" the matching character, so in cases where it matches "a", "a" is consumed and not available for the next group to match, and because using the (?:) syntax is non-capturing, that "a" is "lost", and not available to be captured by the next grouping (b*(?:a)b*), whereas when "^" is consumed by the first grouping, that first "a" would match in the second grouping.

  • 1
    In that case, why don't I get the "bbabb" match in the second case, where the order is swapped and the empty line beginning is matched? – Christian Adam May 30 '19 at 08:44