I'm using regular expressions to find the maximum number of consecutive repeats of a substring in a given string. In the example below, there are 9 consecutive AAGAA
substrings. The first method returns the lengths of all the different stretches of consecutive substrings, and the second returns the overall max. Therefore, max(lens)
should be equal to val.
However, in the method using val
there is a match with 10 repeats of AAGAA
, even though the original string contains a maximum of only 9.
I've spent a lot of time looking at regex tutorials and regex101.com but I can't figure this out. Where is "(?=((" + re.escape(substring) + ")+))"
finding an extra substring?
string='AAGAAAAAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAA'
substring = 'AAGAA'
#this one is right; returns [1,1,9], as desired
sl = len(substring)
regex = re.compile(f'((?:{substring})+)')
lens = [len(m) // sl for m in regex.findall(string)]
#this one is wrong; returns 10, should return 9
pattern = re.compile("(?=((" + re.escape(substring) + ")+))")
matches = re.findall( pattern, string )
val = max(len(m[0]) // len(substring) for m in matches)