I am trying to extract sentences from large text set that contain list of words.
For example searching for "noodl", "vege" and "meat".
str1 = "My new noodles are great\n vegetables. Not \nthis noodle sentence though.\n Nor this vege sentences."
results = re.findall(regex, str1)
Should return "My new noodles are great\n vegetables." as only match.
From (Python extracting sentence containing 2 words) I was able to come up with following regex:
regex = re.compile(
r"""
([^.]*?# Starting with anything but .
(# Capture group start
(noodl|vege|meat)# Countains these words
[^.]*#with anything but . in between
){2,}# At least 2 times
[^.]*\.# Followed by anything but '.' followed by '.'
)
""",
re.MULTILINE | re.IGNORECASE | re.VERBOSE)
But this results in
for x in results:
print(x)
#My new noodles are great\n vegetables.
#vegetables
#vege
Which is unexpected. How should my regex be changed to match only the whole sentences? Found sentences are further processed. The natural language processed is not English but the current results are the same as with demo sentences.