I would like to find all the strings that appear between an element of a list start_signs
and end_signs
. When the element in end_signs
is missing or appearing of context later,
the solution should not be taken.
One solution would be to take all the matches between start_signs
and end_signs
and check, wether the matches contain only words from a third list allowed_words_between
.
import re
allowed_words_between = ["and","with","a","very","beautiful"]
start_signs = ["$","$$"]
end_signs = ["Ferrari","BMW","Lamborghini","ship"]
teststring = """
I would like to be a $-millionaire with a Ferrari. -> Match: $-millionaire with a Ferrari
I would like to be a $$-millionair with a Lamborghini. -> Match: $$-millionair with a Lamborghini
I would like to be a $$-millionair with a rotten Lamborghini. -> No Match because of the word "rotten"
I would like to be a $$-millionair with a Lamborghini and a Ferrari. -> Match: $$-millionair with a Lamborghini and a Ferrari
I would like to be a $-millionaire with a very, very beautiful ship! -> Match: $-millionaire with a very, very beautiful ship
I would like to be a $-millionaire with a very, very beautiful but a bit dirty ship. -> No Match because of the word dirty
I would like to be a $-millionaire with a dog, a cat, two children and a cowboy hat. That would be great. -> No Match
"""
Another solution would be to start the string with the start_signs
and cut it as soon as a string not appearing in an allowed list appears:
allowed_list = allowed_words_between + start_signs + end_signs
What I tried so far:
I used the solution of this post
regexString = "("+"|".join(start_signs) + ")" + ".*?" + "(" +"|".join(end_signs)+")"
and tried to create a regex string that is variable w.r.t. start and end. That is not not working. I also don't know how the content check could work.
matches = re.findall(regexString,teststring)
substituted_text = re.sub(regexString, "[[Found It]]", teststring, count=0)