0

TASK

Find the longest consecutive sequence of a string within a string.

For e.g., given DNA sequence: AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC, find the longest consecutive sequence of AGATC.

WHAT I TRIED

Using search = re.findall(r'(AGATC)+', DNAsequence) only returns me strings with 1 consecutive sequence, e.g. ['AGATC', 'AGATC', 'AGATC', ...]

I found out I needed to use the non-capturing group ?: in order to get the expected output. search = re.findall(r'(?:AGATC)+', DNAsequence) returns me the expected ['AGATC', 'AGATCAGATCAGATC', 'AGATC', ...]

WHAT I NEED HELP WITH

Why do I need to use the non-capturing group expression in order to get more than 1 consecutive sequence? Shouldn't (AGATC)+ on its own already give the expected output? From what I understand, using capturing groups or not also shouldn't affect the search result.

NOTE

This question is highly related to How to catch the longest sequence of a group, but the top answer didn't explain why the non-capturing group has to be used in the syntax. I am unable to add my question as a comment, so I have to create a new post.

vieveee
  • 15
  • 6

1 Answers1

2

The non capturing group is necessary, because your regex pattern specifies the repetitition of AGATC, which be default will tell re.findall to return as match whatever appears in the capture group. To further explain this, the capture group in (AGATC)+ will only return the last match in the case of more than one AGATC. This means that AGATC is the longest match which would ever be returned. By turning off the capture group, it allows re.findall to default to returning the entire match, which is what you want.

seq = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
matches = re.findall(r'(AGATC)+', seq)
print(matches)

This prints:

['AGATC', 'AGATC']

However, turning off the capture group:

seq = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
matches = re.findall(r'(?:AGATC)+', seq)
print(matches)

This prints:

['AGATCAGATC', 'AGATCAGATCAGATCAGATCAGATC']
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360