How to create a regular expression that splits a string based on certain patterns and stops extracting if it detects another of the starting patterns?

Question

It is important that the regular expression extracts the information when it detects one of the 4 patterns, but that it stops doing so if it detects that in that text after the initial pattern it gives the beginning of another of those 4 patterns.

For example, it extracts "en la " and everything that follows, unless it finds for example a "junto con "

These are the 4 pseudo-regex patterns, indicating which is the start "that activates each one"

where pattern:

(en la|en el|en ) {SIDE} hay que {sense}
hay que {sense} (en la|en el) {SIDE}

what_time pattern

(a las|a la) {HOURS} hay que {sense}
hay que {sense} (a las|a la) {HOURS}

with_whom_or_with_what pattern:

(para ellos|para el|para ellas|para ella|junto a |junto con ) {PERSON} hay que {sense}
hay que {sense} (para) {PERSON}

why pattern:

(por) {MATTER_OF_WHY} hay que {sense}
hay que {sense} (por) {MATTER_OF_WHY}

Here I have left some examples on how the regex that includes the 4 cases(and the sense case, in total there are 5 patterns) should work, finishing extracting if any of the other patterns starts. Note that the substrings that are extracted by the patterns must be stored in the variable corresponding to the pattern type.

import re

#at the beginning the 5 variables start with empty strings
sense, where, what_time, with_whom_or_with_what, why = "", "", "", "", ""

input_text = "En la montaña de aquel frio lugar hay que estar preparados para largas noches frías junto a tus compañeros"

# the correct output...
sense = "hay que estar preparados para largas noches frías"
where = "En la montaña de aquel frio lugar"
what_time = ""
with_whom_or_with_what = "junto a tus compañeros"
why = ""

input_text = "Junto a tus compañeros en la montaña de aquel frio lugar hay que estar preparados para largas noches frías"

# the correct output...
sense = "hay que estar preparados para largas noches frías"
where = "en la montaña de aquel frio lugar"
what_time = ""
with_whom_or_with_what = "junto a tus compañeros"
why = ""

input_text = "Junto a tus compañeros en la montaña de aquel frio lugar hay que estar preparados para largas noches frías porque puede ser peligroso y ocurrir accidentes"

# the correct output...
sense = "hay que estar preparados para largas noches frías"
where = "en la montaña de aquel frio lugar"
what_time = ""
with_whom_or_with_what = "junto a tus compañeros"
why = "porque puede ser peligroso y ocurrir accidentes"

I hope you can help me, since I was having enough problems to make more than 3 patterns(in this case this 5 patterns) limit each other and extract only what corresponds to them, in this case these 5 patterns that must limit each other.

Try this regex...

import re

#at the beginning the 5 variables start with empty strings
sense, where, what_time, with_whom_or_with_what, why = "", "", "", "", ""

input_text = "En la montaña de aquel frio lugar hay que estar preparados para largas noches frías junto a tus compañeros"
#input_text = "Junto a tus compañeros en la montaña de aquel frio lugar hay que estar preparados para largas noches frías"
#input_text = "Junto a tus compañeros en la montaña de aquel frio lugar hay que estar preparados para largas noches frías porque puede ser peligroso y ocurrir accidentes"

regex_pattern = r"((?P<sense>(hay que)\s.+?)|(?P<where>(en la|en el|en)\s.+?)|(?P<what_time>(a las|a la)\s.+?)|(?P<with>(para ellos|para el|para ellas|para ella|junto a|junto con)\s.+?)|(?P<why>(por|porque)\s.+?))(?=\b(en la|en el|en|hay que|a las|a la|para ellos|para el|para ellas|para ella|junto a|junto con|por|porque|$|[!\.\?]))"
n = re.search(regex_pattern, input_text, re.IGNORECASE)
if(n):
    group_list = n.groups()
    print(group_list)

Is not giving the correct output

('En la montaña de aquel frio lugar ', None, None, 'En la montaña de aquel frio lugar ', 'En la', None, None, None, None, None, None, 'hay que')

Also considering that the words can appear in the input in any order, then there would be no way to know which element of the list belongs to which search pattern.

I apologize, my Spanish is quite bad. `para` is listed as a pattern for `person`, but also used in the `sense` example. Is there a way to tell them apart? — mcky, Jul 23 '22 at 00:47
@mcky I had not considered it but it can be solved by adding a personal pronoun to the condition, in the **person** pattern, instead of just putting the word **para**, I should have put **para ellos**, **para ello**, **para ellas**, **para ella**. I mean, for **person** pattern, the word **para** must go followed by a pronoun, otherwise it should not enter the **sense** pattern. I already edited the pattern in the question. — , Jul 23 '22 at 00:58

mcky · Accepted Answer · 2022-07-23T05:11:42.213

Here is the pattern I came up with. It's based on an expansion of this thread. Access information from the groups: sense, where, what_time, with, why

Test here.

Breakdown:

First half:

((?P<sense>(hay que)\s.+?)|(?P<where>(en la|en el|en)\s.+?)|(?P<what_time>(a las|a la)\s.+?)|(?P<with>(para ellos|para el|para ellas|para ella|junto a|junto con)\s.+?)|(?P<why>(por|porque)\s.+?))

Each type has (?P<name>(qualifier)\s.+?) which matches the first qualifier then any text afterwards. These are all in a single capture group with |s separating.

(?P<name>...) is a named group for python

\s for a single whitespace character

. for any character (this SHOULD be any text character [\w\x{00C0}-\x{017F}], but I'm not sure how to do the unicode accented characters in python.

+? matches previous (.) atleast once (lazy modifier)

Second half:

After all type capture groups, there is a positive lookahead which finds the qualifier for the next type and uses it as the ending for the current phrase.

(?=\b(en la|en el|en|hay que|a las|a la|para ellos|para el|para ellas|para ella|junto a|junto con|por|porque|\s$|$|[!\.\?]))

(?=...) Positive lookahead (can be matched without reserving characters)

\b for word border. This verifies that we are not matching our pattern words if they are actually parts of other words (para => par, para largas => a la)

The remaining text qualifiers are patterns which each group can end on (notice the $ newline and !.? punctuation.

Usage Example:

import re

regex = r"((?P<sense>(hay que)\s.+?)|(?P<where>(en la|en el|en)\s.+?)|(?P<what_time>(a las|a la)\s.+?)|(?P<with>(para ellos|para el|para ellas|para ella|junto a|junto con)\s.+?)|(?P<why>(por|porque)\s.+?))(?=\b(en la|en el|en|hay que|a las|a la|para ellos|para el|para ellas|para ella|junto a|junto con|por|porque|\s$|$|[!\.\?]))"

test_str = ("Junto a tus compañeros en la montaña de aquel frio lugar hay que estar preparados para largas noches frías porque puede ser peligroso y ocurrir accidentes")

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):
    for groupName in match.groupdict():
        if(match.group(groupName) != None):
            print(groupName + ": " + match.group(groupName))

Output:

with: Junto a tus compañeros 
where: en la montaña de aquel frio lugar 
sense: hay que estar preparados para largas noches frías 
why: porque puede ser peligroso y ocurrir accidentes

I was trying to use that regex pattern you passed me but it doesn't work for me. I have updated the question, adding the code that includes your regex although it is not working well. Also it is difficult to assign each substring to the correct output variable. — , Jul 23 '22 at 02:35

How to create a regular expression that splits a string based on certain patterns and stops extracting if it detects another of the starting patterns?

1 Answers1

Breakdown:

First half:

Second half:

Usage Example: