How to get nested-groups with regexp

Question

I need your help with following regex. I have a text

"[Hello|Hi]. We are [inviting | calling] you at position [[junior| mid junior]|senior] developer."

using regex I want to get

[Hello|Hi]
[inviting | calling]
[[junior| mid junior]|senior]

the following rexeg (\[[^\[$\]\]]*\])

gives me [Hello|Hi] [inviting | calling] [junior| mid junior]

so how should I fix it to get correct output?

The re module doesn't support regex recursion, which is needed for this kind of task. You might want to take a look at https://pypi.python.org/pypi/regex — Sebastian Proske, Oct 28 '16 at 06:42
Most implementations of regular expressions aren't up to the task of parsing nested expressions: http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la PCRE is an extension to regular expressions, which is why the PCRE "regex" solution below looks nothing like the regular expression grammar you're used to. — LinuxDisciple, Oct 28 '16 at 07:07
The solution u accepted will work only for 3 levels.its not a generic solution — vks, Oct 28 '16 at 07:13

John1024 · Accepted Answer · 2016-10-28T07:23:08.177

Let's define your string and import re:

>>> s = "[Hello|Hi]. We are [inviting | calling] you at position [[junior| mid junior]|senior] developer."
>>> import re

Now, try:

>>> re.findall(r'\[ (?:[^][]* \[ [^][]* \])* [^][]*  \]', s, re.X)
['[Hello|Hi]', '[inviting | calling]', '[[junior| mid junior]|senior]']

In more detail

Consider this script:

$ cat script.py
import re
s = "[Hello|Hi]. We are [inviting | calling] you at position [[junior| mid junior]|senior] developer."

matches = re.findall(r'''\[       # Opening bracket
        (?:[^][]* \[ [^][]* \])*  # Zero or more non-bracket characters followed by a [, followed by zero or more non-bracket characters, followed by a ]
        [^][]*                    # Zero or more non-bracket characters
        \]                        # Closing bracket
        ''',
        s,
        re.X)
print('\n'.join(matches))

This produces the output:

$ python script.py
[Hello|Hi]
[inviting | calling]
[[junior| mid junior]|senior]

OP is asking for nested brackets, as soon as you add a third level this won't work anymore. — Sebastian Proske, Oct 28 '16 at 06:50
The extension to three levels is obvious. If he were to need arbitrarily deep nesting, that would be an issue. The OP may want to clarify. — John1024, Oct 28 '16 at 06:59

score 2 · Answer 2 · answered Oct 28 '16 at 07:09

You can use a simple stack to do this instead of recursive regex

x="[Hello|Hi]. We are [inviting | calling] you at position [[junior| mid junior]|senior] developer.[sd[sd[sd][sd]]]"
l=[]
st=[]
start=None
for i,j in enumerate(x):
    if j=='[':
        if j not in st:
            start = i
        st.append(j)
    elif j==']':
        st.pop()
        if not st:
            l.append(x[start:i+1])
print l

Ouput: ['[Hello|Hi]', '[inviting | calling]', '[[junior| mid junior]|senior]', '[sd[sd[sd][sd]]]']

score 1 · Answer 3 · answered Oct 28 '16 at 06:57

You may use the following code with PyPi regex module with a PCRE-like r'\[(?:[^][]++|(?R))*]' regex:

>>> import regex
>>> s = "[Hello|Hi]. We are [inviting | calling] you at position [[junior| mid junior]|senior] developer."
>>> r = regex.compile(r'\[(?:[^][]++|(?R))*]')
>>> print(r.findall(s))
['[Hello|Hi]', '[inviting | calling]', '[[junior| mid junior]|senior]']
>>>

See the regex demo.

The \[(?:[^][]++|(?R))*] matches a [, then zero or more sequences of 1+ chars other than ] and [ OR the whole bracketed expression [...], and then a closing ].

How to get nested-groups with regexp

3 Answers3

In more detail