10

I have a character string 'aabaacaba'. Starting from left, I am trying to get substrings of all sizes >=2, which appear later in the string. For instance, aa appears again in the string and so is the case with ab.

I wrote following regex code:

re.findall(r'([a-z]{2,})(?:[a-z]*)(?:\1)', 'aabaacaba')

and I get ['aa'] as answer. Regular expression misses ab pattern. I think this is because of overlapping characters. Please suggest a solution, so that the expression could be fixed. Thank you.

pylang
  • 40,867
  • 14
  • 129
  • 121
Sumit
  • 2,242
  • 4
  • 25
  • 43

1 Answers1

9

You can use look-ahead assertion which does not consume matched string:

>>> re.findall(r'(?=([a-z]{2,})(?=.*\1))', 'aabaacaba')
['aa', 'aba', 'ba']

NOTE: aba matched instead of ab. (trying to match as long as possible)

falsetru
  • 357,413
  • 63
  • 732
  • 636
  • Can `[a-z]` be replaced with `\w` as `(?=(\w{2,})(?=.*\1))` (?) – Sebastián Palma May 14 '17 at 03:03
  • @SebastiánPalma, Yes it is. But it will also match digits, `_`. I'm not sure whether it's what OP wants or not; so I left it as is (as OP wrote). Maybe `.` is more appropriate if OP wants any character. – falsetru May 14 '17 at 03:05
  • You're totally right, I think he needs just a-z characters, it's similar to [this](http://stackoverflow.com/questions/43954326/how-to-replace-a-character-by-regex-in-ruby) question. – Sebastián Palma May 14 '17 at 03:07
  • 1
    @SebastiánPalma, I couldn't use look-behind assertion, because Python `re` allow only fixed-length look-behind assertion. – falsetru May 14 '17 at 03:08
  • Thank you! This is perfect. Didn't know that there is no restriction on look-behind assertion. – Sumit May 14 '17 at 03:09
  • @Sumit, You're welcome. BTW, you mean `s/no restriction/a restriction/`, right? – falsetru May 14 '17 at 03:11
  • Sorry I meant that there is no restriction on look ahead assertion. Also, could you please explain why there is a ?= in both the brackets? I would have just done in the second bracket. – Sumit May 14 '17 at 03:14
  • 2
    @Sumit, Without the first look-ahead assertion, first matched part will be consumed; overlapped matches(`aba` in this case) will be excluded in the result. – falsetru May 14 '17 at 03:16
  • @falsetru Thank you. Got it. – Sumit May 14 '17 at 03:19
  • 1
    @falsetru great answer. I couldn't think about 1st look-ahead assertion. Learnt something new today :) – Gurmanjot Singh May 14 '17 at 06:13