0

I would like to find all the strings that appear between an element of a list start_signs and end_signs. When the element in end_signs is missing or appearing of context later, the solution should not be taken.

One solution would be to take all the matches between start_signs and end_signs and check, wether the matches contain only words from a third list allowed_words_between.

import re

allowed_words_between = ["and","with","a","very","beautiful"]

start_signs           = ["$","$$"]
end_signs             = ["Ferrari","BMW","Lamborghini","ship"]

teststring = """
             I would like to be a $-millionaire with a Ferrari.                                     -> Match: $-millionaire with a Ferrari
             I would like to be a $$-millionair with a Lamborghini.                                 -> Match: $$-millionair with a Lamborghini
             I would like to be a $$-millionair with a rotten Lamborghini.                          -> No Match because of the word "rotten"
             I would like to be a $$-millionair with a Lamborghini and a Ferrari.                   -> Match: $$-millionair with a Lamborghini and a Ferrari
             I would like to be a $-millionaire with a very, very beautiful ship!                   -> Match: $-millionaire with a very, very beautiful ship
             I would like to be a $-millionaire with a very, very beautiful but a bit dirty ship.                       -> No Match because of the word dirty
             I would like to be a $-millionaire with a dog, a cat, two children and a cowboy hat. That would be great.   -> No Match
             """

Another solution would be to start the string with the start_signs and cut it as soon as a string not appearing in an allowed list appears:

allowed_list = allowed_words_between + start_signs + end_signs

What I tried so far:

I used the solution of this post

regexString = "("+"|".join(start_signs) + ")" + ".*?" + "(" +"|".join(end_signs)+")" 

and tried to create a regex string that is variable w.r.t. start and end. That is not not working. I also don't know how the content check could work.

matches          = re.findall(regexString,teststring)
substituted_text = re.sub(regexString, "[[Found It]]", teststring, count=0)
Uwe.Schneider
  • 1,112
  • 1
  • 15
  • 28

1 Answers1

1

You can repeat all the allowed_words_between optionally followed by a comma and whitespace chars until you reach one of the end_signs.

You can turn the capture groups into non capturing (?: or else re.findall will return the capture group values.

Note to escape the \$ to match it literally

The pattern will look like

(?:\$|\$\$)\S*(?:(?:\s+(?:and|with|a|very|beautiful),?)*\s+(?:Ferrari|BMW|Lamborghini|ship))+

The pattern matches

  • (?:\$|\$\$)\S* Match any of the start_signs followed by optional non whitespace chars (\S can also match a dollar sign, but you can make that more specific like -\w+)
  • (?: Outer non capture group
    • (?: Inner non capture group
      • \s+(?:and|with|a|very|beautiful),? Match any of the allowed_words_between optionally followed by a comma
    • )*\s+ Close inner non capture group and repeat 0+ times followed by 1+ whitspace chars
    • (?:Ferrari|BMW|Lamborghini|ship) Match any of the end_signs
  • )+ Close outer non capture group and repeat 1+ times to also match the string with Lamborghini and a Ferrari

Regex demo | Python demo

import re

allowed_words_between = ["and", "with", "a", "very", "beautiful"]
start_signs = [r"\$", "\$\$"]
end_signs = ["Ferrari", "BMW", "Lamborghini", "ship"]
teststring = """
             I would like to be a $-millionaire with a Ferrari.
             I would like to be a $$-millionair with a Lamborghini.
             I would like to be a $$-millionair with a rotten Lamborghini.
             I would like to be a $$-millionair with a Lamborghini and a Ferrari.
             I would like to be a $-millionaire with a very, very beautiful ship!
             I would like to be a $-millionaire with a very, very beautiful but a bit dirty ship.
             I would like to be a $-millionaire with a dog, a cat, two children and a cowboy hat. That would be great.
             """
regexString = "(?:" + "|".join(start_signs) + ")\S*(?:(?:\s+(?:" + "|".join(allowed_words_between) + "),?)*\s+(?:" + "|".join(end_signs) + "))+"

for s in re.findall(regexString, teststring):
    print(s)

Output

$-millionaire with a Ferrari
$$-millionair with a Lamborghini
$$-millionair with a Lamborghini and a Ferrari
$-millionaire with a very, very beautiful ship
The fourth bird
  • 154,723
  • 16
  • 55
  • 70