It is important that the regular expression extracts the information when it detects one of the 4 patterns, but that it stops doing so if it detects that in that text after the initial pattern it gives the beginning of another of those 4 patterns.
For example, it extracts "en la " and everything that follows, unless it finds for example a "junto con "
These are the 4 pseudo-regex patterns, indicating which is the start "that activates each one"
where pattern:
(en la|en el|en ) {SIDE} hay que {sense}
hay que {sense} (en la|en el) {SIDE}
what_time pattern
(a las|a la) {HOURS} hay que {sense}
hay que {sense} (a las|a la) {HOURS}
with_whom_or_with_what pattern:
(para ellos|para el|para ellas|para ella|junto a |junto con ) {PERSON} hay que {sense}
hay que {sense} (para) {PERSON}
why pattern:
(por) {MATTER_OF_WHY} hay que {sense}
hay que {sense} (por) {MATTER_OF_WHY}
Here I have left some examples on how the regex that includes the 4 cases(and the sense case, in total there are 5 patterns) should work, finishing extracting if any of the other patterns starts. Note that the substrings that are extracted by the patterns must be stored in the variable corresponding to the pattern type.
import re
#at the beginning the 5 variables start with empty strings
sense, where, what_time, with_whom_or_with_what, why = "", "", "", "", ""
input_text = "En la montaña de aquel frio lugar hay que estar preparados para largas noches frías junto a tus compañeros"
# the correct output...
sense = "hay que estar preparados para largas noches frías"
where = "En la montaña de aquel frio lugar"
what_time = ""
with_whom_or_with_what = "junto a tus compañeros"
why = ""
input_text = "Junto a tus compañeros en la montaña de aquel frio lugar hay que estar preparados para largas noches frías"
# the correct output...
sense = "hay que estar preparados para largas noches frías"
where = "en la montaña de aquel frio lugar"
what_time = ""
with_whom_or_with_what = "junto a tus compañeros"
why = ""
input_text = "Junto a tus compañeros en la montaña de aquel frio lugar hay que estar preparados para largas noches frías porque puede ser peligroso y ocurrir accidentes"
# the correct output...
sense = "hay que estar preparados para largas noches frías"
where = "en la montaña de aquel frio lugar"
what_time = ""
with_whom_or_with_what = "junto a tus compañeros"
why = "porque puede ser peligroso y ocurrir accidentes"
I hope you can help me, since I was having enough problems to make more than 3 patterns(in this case this 5 patterns) limit each other and extract only what corresponds to them, in this case these 5 patterns that must limit each other.
Try this regex...
import re
#at the beginning the 5 variables start with empty strings
sense, where, what_time, with_whom_or_with_what, why = "", "", "", "", ""
input_text = "En la montaña de aquel frio lugar hay que estar preparados para largas noches frías junto a tus compañeros"
#input_text = "Junto a tus compañeros en la montaña de aquel frio lugar hay que estar preparados para largas noches frías"
#input_text = "Junto a tus compañeros en la montaña de aquel frio lugar hay que estar preparados para largas noches frías porque puede ser peligroso y ocurrir accidentes"
regex_pattern = r"((?P<sense>(hay que)\s.+?)|(?P<where>(en la|en el|en)\s.+?)|(?P<what_time>(a las|a la)\s.+?)|(?P<with>(para ellos|para el|para ellas|para ella|junto a|junto con)\s.+?)|(?P<why>(por|porque)\s.+?))(?=\b(en la|en el|en|hay que|a las|a la|para ellos|para el|para ellas|para ella|junto a|junto con|por|porque|$|[!\.\?]))"
n = re.search(regex_pattern, input_text, re.IGNORECASE)
if(n):
group_list = n.groups()
print(group_list)
Is not giving the correct output
('En la montaña de aquel frio lugar ', None, None, 'En la montaña de aquel frio lugar ', 'En la', None, None, None, None, None, None, 'hay que')
Also considering that the words can appear in the input in any order, then there would be no way to know which element of the list belongs to which search pattern.