How to catch the longest sequence of a group

Question

The task is to find the longest sequence of a group

for instance, given DNA sequence: "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC" and it has 7 occurrences of AGATC. (AGATC) matches all occurrences. Is it possible to write a regular expression that catches only the longest sequence, i.e. AGATCAGATCAGATCAGATCAGATC in the given text? If this is not possible only with regex, how can I iterate through each sequence (i.e. 1st sequence is AGATCAGATC, 2nd - AGATCAGATCAGATCAGATCAGATC et cetera) in python?

Shubham Sharma · Accepted Answer · 2020-05-29T04:49:48.987

4

Use:

import re

sequence = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"
matches = re.findall(r'(?:AGATC)+', sequence)

# To find the longest subsequence
longest = max(matches, key=len)

Explanation:

Non-capturing group (?:AGATC)+

+ Quantifier — Matches between one and unlimited times, as many times as possible.
AGATC matches the characters AGATC literally (case sensitive)

Result:

# print(matches)
['AGATCAGATC', 'AGATCAGATCAGATCAGATCAGATC']

# print(longest)
'AGATCAGATCAGATCAGATCAGATC'

You can test the regex here.

edited May 29 '20 at 04:49

answered May 29 '20 at 04:43

Shubham Sharma

68,127
6
24
53

1

Such a short and elegant solution! Thank you a lot – hamvee May 29 '20 at 05:26
This is a good solution, probably the one I would use, but it does not answer the question, "Is it possible to write a regular expression that catches only the longest sequence?". – Cary Swoveland May 29 '20 at 17:37
@CarySwoveland Well, I have taken a alternate approach for the question even this doesn't find the longest single subsequence using regex only but it does find the all the consecutive subsequences of longest possible length, therefore we can easily select the longest subsequence using just the max function. – Shubham Sharma May 29 '20 at 17:46

Cary Swoveland · Answer 2 · 2020-06-30T03:20:12.750

The central question is, "Is it possible to write a regular expression that catches only the longest sequence?" The answer is "yes":

import re

s = 'AGATC_AGATCAGATC_AGATCAGATCAGATC_AGATC_AGATCAGATC'

m = re.search(r'((?:AGATC)+)(?!.*\1)', s)
print m.group() if m else ''
  #=> "AGATCAGATCAGATC"

Regex demo_{^<¯\(ツ)/¯^>}Python demo

Python's regex engine performs the following operations.

(            begin capture group 1
  (?:AGATC)  match 'AGATC' in a non-capture group
  +          execute the non-capture group 1+ times
)            end capture group 1
(?!          begin a negative lookahead
  .*         match 0+ characters
  \1         match the content of capture group 1
)            end the negative lookahead

For the string s above, AGATC would first be matched but the negative lookahead would find AGATC as the first part of AGATCAGATC, so the tentative match would be rejected. Then AGATCAGATC would be matched, but the negative lookahead would find AGATCAGATC as the first part of AGATCAGATCAGATC so that tentative match would also be rejected. Next, AGATCAGATCAGATC would be matched and accepted, as the negative lookahead would not find that match later in the string. (re.findall, unlike re.search, would also match AGATCAGATC at the end of the string.)

If re.findall were used there may be multiple matches after the longest one (see the last test string at the link to the regex demo), but the lengths of the matches are non-decreasing from the first to the last. Therefore, the first match, obtained using re.search is a longest match.

Good answer @caryswoveland +1. – Shubham Sharma May 29 '20 at 17:48 — Shubham Sharma, May 29 '20 at 17:48

score 1 · Answer 3 · answered May 29 '20 at 04:48

Use re.finditer() to iterate over all matches. Then use max() with a key function to find the longest. Make it a function so you can use different groups.

import re

def find_longest(sequence, group):
    # build pattern
    pattern = fr"(?:{group})+"

    # iterate over all matches
    matches = (match[0] for match in re.finditer(pattern, sequence))

    # find the longest
    return max(matches, key=len)

seq = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC"

find_longest(seq, "AGATC")

How to catch the longest sequence of a group

3 Answers3

Linked