Finding a given substring in a string with Regular Expression in Python

Question

I am trying to find all the occurrences of a substring in a string like below:

import re
S = 'aaadaa'
matches = re.finditer('(aa)', S)
if matches:
  #print(matches)
  for match in matches:
   print(match)
else:
    print("No match")

The current output is:

<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(4, 6), match='aa'>

But I am expecting that it should give the values as:

<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(1, 3), match='aa'>
<re.Match object; span=(4, 6), match='aa'>

Could someone please help me on this?

As far as I can tell, the `find*` functions just aren’t designed to return overlapping matches. It does look like [this](https://pypi.org/project/regex/) alternative RegEx library supports the feature. — AMC, Dec 02 '19 at 04:35
Does this answer your question? [Python regex find all overlapping matches?](https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches) — AMC, Dec 02 '19 at 04:38
Check this post : https://stackoverflow.com/questions/4664850/how-to-find-all-occurrences-of-a-substring — furkanayd, Dec 02 '19 at 04:45
@furkanayd Ah nice, it’s a more popular question than the one I found. — AMC, Dec 02 '19 at 04:47
Does this answer your question? [How to find all occurrences of a substring?](https://stackoverflow.com/questions/4664850/how-to-find-all-occurrences-of-a-substring) — David Maze, Dec 02 '19 at 05:31
@furkanayd, Thanks and that works good, but the start and end indexes are printing to be same. Any thoughts on this? — Mathan, Dec 02 '19 at 05:42
@Mathan it is because of the search text length I assume, try with len 3 string and corresponding main string such as "aaaadaaa" with "aaa", it will result with start = end - 1. — furkanayd, Dec 02 '19 at 05:46
@furkanayd It isn't because of the text's length, it's the fact that the pattern's outer level is a non-capturing group. — AMC, Dec 02 '19 at 21:19

AMC · Answer 1 · 2019-12-02T05:40:48.997

1

Taken from the answer I linked in the comments, here is the pattern you need: (?=(aa)).

You’ll have to access the matched substring using match_obj.groups(1), and the match indices using match_obj.span(1).

edited Dec 02 '19 at 05:40

answered Dec 02 '19 at 04:42

AMC

2,642
7
13
35

Yes, that works great. But when I tried to print the starting and ending indexes of those matches, it doesn't work as expected. It gives both starting and ending indexes as same like below: (0, 0) (1, 1) (4, 4) – Mathan Dec 02 '19 at 05:35
Thanks, it prints the ending index increased by 1 as below as it uses the string slice mechanism (0, 2) (1, 3) (4, 6). But I should get (0, 1) (1, 2) (4, 5). – Mathan Dec 02 '19 at 06:02
@Mathan I’m not sure what you mean. That’s just the way the match object counts indices. They’re the same indices as the ones in your post. – AMC Dec 02 '19 at 06:15
@Mathan Did you figure it out in the end? – AMC Dec 02 '19 at 21:20

score 0 · Answer 2 · answered Dec 02 '19 at 04:38

The problem here is that once the re module matches a double aa, it will also consume both of the letters. But, you want overlapping matches. One trick you could use here would be to search for a(?=a):

S = 'aaadaa'
matches = re.findall(r'a(?=a)', S)
matches = [s + "a" for s in matches]
print(matches)

['aa', 'aa', 'aa']

Note that we tag on the second a to the output list, since only the first letter is actually matched at each step.

Finding a given substring in a string with Regular Expression in Python

I am trying to find all the occurrences of a substring in a string like below:

2 Answers2