4

I need your help with following regex. I have a text

"[Hello|Hi]. We are [inviting | calling] you at position [[junior| mid junior]|senior] developer."

using regex I want to get

[Hello|Hi]
[inviting | calling]
[[junior| mid junior]|senior]

the following rexeg (\[[^\[$\]\]]*\])

gives me [Hello|Hi] [inviting | calling] [junior| mid junior]

so how should I fix it to get correct output?

  • The re module doesn't support regex recursion, which is needed for this kind of task. You might want to take a look at https://pypi.python.org/pypi/regex – Sebastian Proske Oct 28 '16 at 06:42
  • Most implementations of regular expressions aren't up to the task of parsing nested expressions: http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la PCRE is an extension to regular expressions, which is why the PCRE "regex" solution below looks nothing like the regular expression grammar you're used to. – LinuxDisciple Oct 28 '16 at 07:07
  • The solution u accepted will work only for 3 levels.its not a generic solution – vks Oct 28 '16 at 07:13

3 Answers3

3

Let's define your string and import re:

>>> s = "[Hello|Hi]. We are [inviting | calling] you at position [[junior| mid junior]|senior] developer."
>>> import re

Now, try:

>>> re.findall(r'\[ (?:[^][]* \[ [^][]* \])* [^][]*  \]', s, re.X)
['[Hello|Hi]', '[inviting | calling]', '[[junior| mid junior]|senior]']

In more detail

Consider this script:

$ cat script.py
import re
s = "[Hello|Hi]. We are [inviting | calling] you at position [[junior| mid junior]|senior] developer."

matches = re.findall(r'''\[       # Opening bracket
        (?:[^][]* \[ [^][]* \])*  # Zero or more non-bracket characters followed by a [, followed by zero or more non-bracket characters, followed by a ]
        [^][]*                    # Zero or more non-bracket characters
        \]                        # Closing bracket
        ''',
        s,
        re.X)
print('\n'.join(matches))

This produces the output:

$ python script.py
[Hello|Hi]
[inviting | calling]
[[junior| mid junior]|senior]
John1024
  • 109,961
  • 14
  • 137
  • 171
2

You can use a simple stack to do this instead of recursive regex

x="[Hello|Hi]. We are [inviting | calling] you at position [[junior| mid junior]|senior] developer.[sd[sd[sd][sd]]]"
l=[]
st=[]
start=None
for i,j in enumerate(x):
    if j=='[':
        if j not in st:
            start = i
        st.append(j)
    elif j==']':
        st.pop()
        if not st:
            l.append(x[start:i+1])
print l

Ouput: ['[Hello|Hi]', '[inviting | calling]', '[[junior| mid junior]|senior]', '[sd[sd[sd][sd]]]']

vks
  • 67,027
  • 10
  • 91
  • 124
1

You may use the following code with PyPi regex module with a PCRE-like r'\[(?:[^][]++|(?R))*]' regex:

>>> import regex
>>> s = "[Hello|Hi]. We are [inviting | calling] you at position [[junior| mid junior]|senior] developer."
>>> r = regex.compile(r'\[(?:[^][]++|(?R))*]')
>>> print(r.findall(s))
['[Hello|Hi]', '[inviting | calling]', '[[junior| mid junior]|senior]']
>>> 

See the regex demo.

The \[(?:[^][]++|(?R))*] matches a [, then zero or more sequences of 1+ chars other than ] and [ OR the whole bracketed expression [...], and then a closing ].

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563