1

I have a text like this:

; Robert ( #Any kind of character here# ) #Any kind of character here#; 
John ( #Any kind of character here# )

So, in order to look if the text ends like Robert(...) or like John(...) using regular expressions in Python, I use something like this:

if re.search(r'[;\s+]Robert\s*[(].*[)]\s*$', text, re.DOTALL) is not None:
    # Some code here
elif re.search(r'[;\s+]John\s*[(].*[)]\s*$', text, re.DOTALL) is not None:
    # Some code here

The problem is that since there could be anything inside the parenthesis (even more pairs of opened-closed parenthesis), I used de dot with the option DOTALL, so it runs till the last parenthesis and the finds 'Robert(...)' everytime although the right answer is 'John(...)'.

So, how can I solve this problem and make it stop at the correct parenthesis to find 'John'?

martineau
  • 119,623
  • 25
  • 170
  • 301
Jaime_mc2
  • 673
  • 5
  • 18
  • Please provide an example where your regex fails. – cs95 Jul 16 '17 at 15:56
  • 1
    So as far as I can see this is more or less a duplicate of https://stackoverflow.com/q/5454322/4153464. TL;DR regex simply isn't built for this, it is for [regular languages](https://en.wikipedia.org/wiki/Regular_language), you're looking for a more complete parser. – Work of Artiz Jul 16 '17 at 15:56
  • To me it sounds like you're looking for the **lazy dot star**, ie `.*?` – Jan Jul 16 '17 at 15:58
  • Instead of dot-star for ```Any kind of character here``` match *anything but a right paren* ```[^)]*``` - ```;\s*(Robert|John)\s?(\([^)]*\))```. Then use re.finditer (or .findall) and use the last match found. – wwii Jul 16 '17 at 16:47
  • Do you use an online regex tester to play around with patterns? If not you should. – wwii Jul 16 '17 at 16:51

2 Answers2

0

The re module doesn't have a feature to deal with nested brackets, however the regex module has a recursion feature (and more):

import regex

s='''; Robert ( #Any kind of character here# ) #Any kind of character here#; 
John ( #Any kind of character here# )'''

pat = r'(?r);\s*(Robert|John)\s*\(([^()]*+(?:\((?2)\)[^()]*)*+)\)\s*$'

m = regex.search(pat, s)

if m is not None:
    if m.group(1) == 'Robert':
        print('Robby')
    else:
        print('Johnny')

pattern details:

(?r)  # reverse search modifier: search from the end of the string
;\s*  #
(Robert|John) \s* # capture group 1
\(
(    # capture group 2
    [^()]*+ # all that isn't a bracket
    (?:
        \( (?2) \) # recursion with the capture group 2 subpattern
        [^()]*
    )*+
)
\) \s* $
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

DISCLAIMER, this post 'works' but should NEVER be used

So first of all, as I commented earlier, regex isn't meant to be recursive, you may need to make use of a module like pyparsing if you want to solve this cleanly.

If you still desperately want to shoot yourself in the foot and use regex for something it wasn't intended to do, you can make use of the regex module. A technique Casimir beautifully explained with fully working recursive regex. I wouldn't recommend doing it this way, but I can't judge your current position.

But hey, why shoot yourself in the foot when you can take the entire leg with it? By only using the built-in re module of course :D So without further delays, here's to making an unmaintainable mess and keeping your job indefinitely until they fully rewrite whatever you're making:

import re

n = 25 # level of nesting allowed, must be specified due to python regex not being recursive
parensre = r"\([^()]*" + r"(?:\([^()]*" * n + r"[^()]*\))?" * n + r"[^()]*\)"

robertre = re.compile(r"Robert\s*" + parensre, re.M | re.S)
johnre   = re.compile(r"John\s*" + parensre, re.M | re.S)

tests = """
  Robert (Iwant(to(**doRegexMyWay(hithere) * 8) / 3) + 1) ; John (whatever())
John(I dont want to anymore())
"""

print robertre.findall(tests) # outputs ['Robert (Iwant(to(**doRegexMyWay(hithere) * 8) / 3) + 1)']
print johnre.findall(tests)   # outputs ['John (whatever())', 'John(I dont want to anymore())']

You can of course mix and combine the parts, with parensre being the cornerstone brick of your already collapsing sandcastle. The trick is to create n (defaulting to 25) non-capturing groups, all nested inside each other. With a single group being structured like ( non-brackets capturing-group non-brackets )

A taste of the regex it generates:

\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*(?:\([^()]*[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\))?[^()]*\)

TL;DR please don't ever try to do this with re

Work of Artiz
  • 1,085
  • 7
  • 16