-2

I am trying to find all the occurrences of a substring in a string like below:

import re
S = 'aaadaa'
matches = re.finditer('(aa)', S)
if matches:
  #print(matches)
  for match in matches:
   print(match)
else:
    print("No match")

The current output is:

<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(4, 6), match='aa'> 

But I am expecting that it should give the values as:

<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(1, 3), match='aa'>
<re.Match object; span=(4, 6), match='aa'>

Could someone please help me on this?

Community
  • 1
  • 1
Mathan
  • 1
  • As far as I can tell, the `find*` functions just aren’t designed to return overlapping matches. It does look like [this](https://pypi.org/project/regex/) alternative RegEx library supports the feature. – AMC Dec 02 '19 at 04:35
  • Does this answer your question? [Python regex find all overlapping matches?](https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches) – AMC Dec 02 '19 at 04:38
  • Check this post : https://stackoverflow.com/questions/4664850/how-to-find-all-occurrences-of-a-substring – furkanayd Dec 02 '19 at 04:45
  • 1
    @furkanayd Ah nice, it’s a more popular question than the one I found. – AMC Dec 02 '19 at 04:47
  • Does this answer your question? [How to find all occurrences of a substring?](https://stackoverflow.com/questions/4664850/how-to-find-all-occurrences-of-a-substring) – David Maze Dec 02 '19 at 05:31
  • @furkanayd, Thanks and that works good, but the start and end indexes are printing to be same. Any thoughts on this? – Mathan Dec 02 '19 at 05:42
  • @Mathan it is because of the search text length I assume, try with len 3 string and corresponding main string such as "aaaadaaa" with "aaa", it will result with start = end - 1. – furkanayd Dec 02 '19 at 05:46
  • @furkanayd It isn't because of the text's length, it's the fact that the pattern's outer level is a non-capturing group. – AMC Dec 02 '19 at 21:19

2 Answers2

1

Taken from the answer I linked in the comments, here is the pattern you need: (?=(aa)).

You’ll have to access the matched substring using match_obj.groups(1), and the match indices using match_obj.span(1).

AMC
  • 2,642
  • 7
  • 13
  • 35
  • Yes, that works great. But when I tried to print the starting and ending indexes of those matches, it doesn't work as expected. It gives both starting and ending indexes as same like below: (0, 0) (1, 1) (4, 4) – Mathan Dec 02 '19 at 05:35
  • Thanks, it prints the ending index increased by 1 as below as it uses the string slice mechanism (0, 2) (1, 3) (4, 6). But I should get (0, 1) (1, 2) (4, 5). – Mathan Dec 02 '19 at 06:02
  • @Mathan I’m not sure what you mean. That’s just the way the match object counts indices. They’re the same indices as the ones in your post. – AMC Dec 02 '19 at 06:15
  • @Mathan Did you figure it out in the end? – AMC Dec 02 '19 at 21:20
0

The problem here is that once the re module matches a double aa, it will also consume both of the letters. But, you want overlapping matches. One trick you could use here would be to search for a(?=a):

S = 'aaadaa'
matches = re.findall(r'a(?=a)', S)
matches = [s + "a" for s in matches]
print(matches)

['aa', 'aa', 'aa']

Note that we tag on the second a to the output list, since only the first letter is actually matched at each step.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360