I've searched Google for my use-case but didn't find anything much useful.
I am not an expert in regular expression so I would appreciate if anybody in the community could help.
Question:
Given a text file, I want to capture the longest string between two substrings (prefix and suffix) using regex. Note that those two substrings will always be at the start of any lines of the text. Please see the below example.
Substrings:
prefixes = ['Item 1', 'Item 1a', 'Item 1b']
suffixes = ['Item 2', 'Item 2a', 'Item 2b']
Example 1:
Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2 ....
Item 1 ....
Item 2
Item 1a ....
....
....
....
....
Item 2b ....
Expected Result:
Item 1a ....
....
....
....
....
Why this result?
Because prefix of Item 1a
and suffix of Item 2b
matches the longest string in the text between them of all other prefix-suffix pair.
Example 2:
Item 1 ....
Item 2 ....
Item 1 ....
....
....
Item 2
.... Item 1 ....
Item 2
Item 1a .... ....
....
....
.... Item 2b
....
Expected result:
Item 1 ....
....
....
Why this result?
This is because this is the largest string between two strings (prefix and suffix pair) where both prefix and suffix starts at the beginning of the line. Note that there's another pair (Item 1a
-Item 2b
) but since Item 2b
does not comes at the beginning of the line, we cannot consider this longest sequence.
What I have tried with regex:
I have tried with below regex for each prefix-suffix pair in my above list, but this didn't work.
regexs = [r'^' + re.escape(pre) + '(.*?)' + re.escape(suf) for pre in prefixes for suf in suffixes]
for regex in regexs:
re.findall(regex, text, re.MULTLINE)
What I have tried using non-regex (Python string functions):
def extract_longest_match(text, prefixes, suffixes):
longest_match = ''
for line in text.splitlines():
if line.startswith(tuple(prefixes)):
beg_index = text.index(line)
for suf in suffixes:
end_index = text.find(suf, beg_index+len(line))
match = text[beg_index:end_index]
if len(match) > len(longest_match ):
longest_match = match
return longest_match
Am I missing anything?