-1

I'm using regular expressions to find the maximum number of consecutive repeats of a substring in a given string. In the example below, there are 9 consecutive AAGAA substrings. The first method returns the lengths of all the different stretches of consecutive substrings, and the second returns the overall max. Therefore, max(lens) should be equal to val. However, in the method using val there is a match with 10 repeats of AAGAA, even though the original string contains a maximum of only 9.

I've spent a lot of time looking at regex tutorials and regex101.com but I can't figure this out. Where is "(?=((" + re.escape(substring) + ")+))" finding an extra substring?

string='AAGAAAAAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAA'
substring = 'AAGAA'

#this one is right; returns [1,1,9], as desired
sl = len(substring)    
regex = re.compile(f'((?:{substring})+)')    
lens = [len(m) // sl for m in regex.findall(string)]

#this one is wrong; returns 10, should return 9
pattern = re.compile("(?=((" + re.escape(substring) + ")+))")
matches = re.findall( pattern, string )
val = max(len(m[0]) // len(substring) for m in matches)
ichthyophile
  • 107
  • 8

1 Answers1

1

The reason you are seeing an extra substring is because the regex you are using will find overlapping matches (see this question for an explanation, but essentially the reason it finds overlapping matches is because the regex only contains a lookahead; this doesn't consume any characters) and so it allows the regex to match not only the single occurrence of AAGAA starting at string[9], but also the sequence of 10 occurrences starting at string[13]. Since this latter match partially overlaps the previous one, it does not get matched by your first regex (which actually consumes characters as it matches). The first regex matches the sequence of 9 occurrences starting at string[18] instead because having matched the single occurrence at string[9], the next place it looks for a match is at string[14], at which point it has passed the start of the match of 10 occurrences.

Nick
  • 138,499
  • 22
  • 57
  • 95