TASK
Find the longest consecutive sequence of a string within a string.
For e.g., given DNA sequence: AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC
, find the longest consecutive sequence of AGATC
.
WHAT I TRIED
Using search = re.findall(r'(AGATC)+', DNAsequence)
only returns me strings with 1 consecutive sequence, e.g. ['AGATC', 'AGATC', 'AGATC', ...]
I found out I needed to use the non-capturing group ?:
in order to get the expected output. search = re.findall(r'(?:AGATC)+', DNAsequence)
returns me the expected ['AGATC', 'AGATCAGATCAGATC', 'AGATC', ...]
WHAT I NEED HELP WITH
Why do I need to use the non-capturing group expression in order to get more than 1 consecutive sequence? Shouldn't (AGATC)+
on its own already give the expected output? From what I understand, using capturing groups or not also shouldn't affect the search result.
NOTE
This question is highly related to How to catch the longest sequence of a group, but the top answer didn't explain why the non-capturing group has to be used in the syntax. I am unable to add my question as a comment, so I have to create a new post.