The central question is, "Is it possible to write a regular expression that catches only the longest sequence?" The answer is "yes":
import re
s = 'AGATC_AGATCAGATC_AGATCAGATCAGATC_AGATC_AGATCAGATC'
m = re.search(r'((?:AGATC)+)(?!.*\1)', s)
print m.group() if m else ''
#=> "AGATCAGATCAGATC"
Regex demo<¯\(ツ)/¯>Python demo
Python's regex engine performs the following operations.
( begin capture group 1
(?:AGATC) match 'AGATC' in a non-capture group
+ execute the non-capture group 1+ times
) end capture group 1
(?! begin a negative lookahead
.* match 0+ characters
\1 match the content of capture group 1
) end the negative lookahead
For the string s
above, AGATC
would first be matched but the negative lookahead would find AGATC
as the first part of AGATCAGATC
, so the tentative match would be rejected. Then AGATCAGATC
would be matched, but the negative lookahead would find AGATCAGATC
as the first part of AGATCAGATCAGATC
so that tentative match would also be rejected. Next, AGATCAGATCAGATC
would be matched and accepted, as the negative lookahead would not find that match later in the string. (re.findall
, unlike re.search
, would also match AGATCAGATC
at the end of the string.)
If re.findall
were used there may be multiple matches after the longest one (see the last test string at the link to the regex demo), but the lengths of the matches are non-decreasing from the first to the last. Therefore, the first match, obtained using re.search
is a longest match.