Regex for longest matching sequence between two strings

Question

I've searched Google for my use-case but didn't find anything much useful.

I am not an expert in regular expression so I would appreciate if anybody in the community could help.

Question:

Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.

Substrings:

prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']

Example 1:

Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ....

Expected Result:

Item 1a ....
....
....
....
....

Why this result?

Because prefix of Item 1a and suffix of Item 2b matches the longest string in the text between them of all other prefix-suffix pair.

Example 2:

Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2
.... Item 1 ....
Item 2
Item 1a .... ....
....
....
.... Item 2b
....

Expected result:

Item 1 ....
....
....

Why this result?

This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a-Item 2b) but since Item 2b does not comes at the beginning of the line, we cannot consider this longest sequence.

What I have tried with regex:

I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.

regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
    re.findall(regex, text, re.MULTLINE)

What I have tried using non-regex (Python string functions):

def extract_longest_match(text, prefixes, suffixes):
    longest_match = ''
    for line in text.splitlines():
        if line.startswith(tuple(prefixes)):
            beg_index = text.index(line)
            for suf in suffixes:
                end_index = text.find(suf, beg_index+len(line))
                match = text[beg_index:end_index]
                if len(match) > len(longest_match ):
                    longest_match = match
    return longest_match

Am I missing anything?

Regex can't match the longest substring. Use the regex or a non-regex solution to find the substrings and find the longest using the common language means. — Wiktor Stribiżew, Nov 19 '18 at 21:06
Yes, that is a possibility to check length of string returned with regex. But do you know how to get text between prefix and suffix where both prefix and suffix start at the left side of sentence? Is my regex correct? — Saurabh Gokhale, Nov 19 '18 at 21:07
@WiktorStribiżew Regex can match the longest substring. The problem is that with `(.*?)` OP is explicitly using a non-greedy regex. Should probably only be `(.*)` instead of `(.*?)`. — quant, Nov 19 '18 at 21:07
@quant That `.*` won't yield the longest substring, that is just matching from the leftmost occurrence of the leading delimiter till the rightmost occurrence of trailing delimiter. And that is not the same. It is a common confusion of greediness and longest/shortest substring extraction. — Wiktor Stribiżew, Nov 19 '18 at 21:09
@WiktorStribiżew What am I missing here then? What's the correct regex to match string between prefix and suffix. I can then later check length and maintain largest string in a variable. — Saurabh Gokhale, Nov 19 '18 at 21:19
Surely you miss `re.DOTALL`. You do not need `re.M` flag. And probably you want to match overlapping matches. — Wiktor Stribiżew, Nov 19 '18 at 21:22
See [Python regex find all overlapping matches?](https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches) — Wiktor Stribiżew, Nov 19 '18 at 21:30
@WiktorStribiżew Thanks. I was trying with non-regex solution (using plain string functions) in Python. Please see the edited portion of questions. It is still not working correctly. Am I missing something there? — Saurabh Gokhale, Nov 19 '18 at 21:56

Wiktor Stribiżew · Accepted Answer · 2018-11-19T22:22:20.850

You need to

Build a regex that matches strings from the leftmost starting delimiter to the leftmost trailing delimiter (see Match text between two strings with regular expression)
Make sure the delimiters are matches at the line start positions only
Make sure the . matches the line break chars by using re.DOTALL or equivalent options (see Python regex, matching pattern over multiple lines)
Make sure the regex matches overlapping substrings (see Python regex find all overlapping matches)
Find all matches in the text (see How can I find all matches to a regular expression in Python?)
Get the longest one (see Python's most efficient way to choose longest string in list?).

Python demo:

import re
s="""Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ...."""
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
rx = r"(?=^((?:{}).*?^(?:{})))".format("|".join(prefixes), "|".join(suffixes))
# Or, a version with word boundaries:
# rx = r"(?=^((?:{})\b.*?^(?:{})\b))".format("|".join(prefixes), "|".join(suffixes))
all_matches = re.findall(rx, s, re.S | re.M)
print(max(all_matches, key=len))

Output:

Item 1a ....
....
....
....
....
Item 2

The regex looks like

(?sm)(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)))

With word boundaries

(?sm)(?=^((?:Item 1|Item 1a|Item 1b)\b.*?^(?:Item 2|Item 2a|Item 2b)\b))

See the regex demo.

Details

(?sm) - re.S and re.M flags
(?=^((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b))) - a positive lookahead that matches at any location that is immediately followed with a sequence of patterns:
- ^ - start of a line
- ((?:Item 1|Item 1a|Item 1b).*?^(?:Item 2|Item 2a|Item 2b)) - Group 1 (this value is returned with re.findall)
- (?:Item 1|Item 1a|Item 1b) - any of the items in the alternation (probably, it makes sense to add \b word boundary after ) here)
- .*? - any 0+ chars, as few as possible
- ^ - start of a line
- (?:Item 2|Item 2a|Item 2b) - any alternative from the list (probably, it also makes sense to add \b word boundary after ) here).

Thank you for your quick solution. One small issueI see on my test example (not shown here) is it also matches any suffixes which comes anywhere in the string. I want to check for such prefix-suffix pair where both prefix and suffix comes at the beginning of the line. Sorry if my example in the question mislead you to believe otherwise. :-) — Saurabh Gokhale, Nov 19 '18 at 22:05
I've edited my question with example 2. So maybe that example 2 will help to explain my use-case better :-) — Saurabh Gokhale, Nov 19 '18 at 22:10
@sgokhales Perhaps, you actually want `rx = r"(?=^((?:{})\b.*?^(?:{})\b))".format("|".join(prefixes), "|".join(suffixes))` to match the delimiters as whole words. See this [**Python demo**](https://ideone.com/52Xumx). — Wiktor Stribiżew, Nov 19 '18 at 22:15
With your above regex in the comments, it throws me an error: max() arg is an empty sequence. Perhaps, it didn't find any match. — Saurabh Gokhale, Nov 19 '18 at 22:17
@sgokhales Sorry, forgot to use a raw string literal. Comment above is updated. — Wiktor Stribiżew, Nov 19 '18 at 22:19
Also, thank you for all the reference links. It will definitely help to learn some regex'ing. — Saurabh Gokhale, Nov 19 '18 at 22:20

Regex for longest matching sequence between two strings

1 Answers1