1

I am parsing some input text and need to flag all other than a recognized set of permissible characters as illegal, except when those characters occur within a pair of parantheses. Effectively, parantheses should protect illegal characters from being caught.

Among the SO search results I found this the only similar one: Find nth character except if its enclosed in brackets php but I am not sure how to adapt that to my case.

For example, how to construct a regex to flag all non-alphabetic (say [^a-z]) characters except when they occur within parantheses (obviously the parantheses would themselves be legal)?

Community
  • 1
  • 1
jamadagni
  • 1,214
  • 2
  • 13
  • 18
  • an example would be better. – Avinash Raj Jul 06 '14 at 14:58
  • 1
    A regex-only solution is possible (using Python regexes) iff parentheses can never be nested and are always correctly balanced. Is that the case for your input? – Tim Pietzcker Jul 06 '14 at 14:59
  • 1
    @TimPietzcker You may already know this, but using the `regex` module instead of `re`, we can also match nested parentheses as it supports recursion. :) – zx81 Jul 06 '14 at 15:31

1 Answers1

4

Let's work with your example:

how to construct a regex to flag all non-alphabetic (say [^a-z]) characters except when they occur within parantheses

This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

We can look at two options, depending on whether or not parentheses can be nested.

Option 1: No Nesting

We can use this simple regex:

\([^)]*\)|([^a-z()]+)

The left side of the alternation | matches complete (parentheses). We will ignore these matches. The right side matches and captures the offending characters to Group 1, and we know they are the right ones because they were not matched by the expression on the left.

This program shows how to use the regex (see the results at the bottom of the online demo):

import re
subject = '[]{}&&& ThisIs(OK)'
regex = re.compile(r'\([^)]*\)|([^a-z()]+)')
# put Group 1 captures in a list
matches = [group for group in re.findall(regex, subject) if group]

print("\n" + "*** Matches ***")
if len(matches)>0:
for match in matches:
print (match)

Option 2: Nested Parentheses

If for any reason parentheses can be nested, use Matthew Barnett's regex module for Python, substituting this recursive regex on the left side of the | to match the parentheses: \((?:[^()]++|(?R))*\). The overall regex therefore becomes:

\((?:[^()]++|(?R))*\)|([^a-z()]+)

Reference

Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • Hello zx81 -- I guess I'll be accepting your answer in a while. I read your references (esp the long answer you gave in your first one). I had initially wanted to just test `if re.match(illegalCatcher, string)` and was wondering if it could be done somehow using look(?:ahead|behind)s (!) -- would you say that it is *impossible* to do so even using the newer `regex` module or is it just that it would be *impractical*? – jamadagni Jul 08 '14 at 03:34
  • Glad this helps, jamadagni. :) If you're wanting to match these illegal chars **except** in a certain context, then this is by far the most reliable solution. Of course you can use a lookahead to check that the illegal char is not followed by non-pars + a closing par: `(?![^)]*\)`, but that does not guarantee there was an opening par on the left of your illegal char... Also there's a hack to check that the number of pars to the right of the illegal character is even... But same flaw. No approach is perfect all of the time, you pick! :) – zx81 Jul 08 '14 at 03:46