3

I am trying to workout a regular expression that can be used to parse logical expression with brackets

For example:

((weight gt 10) OR (weight lt 100)) AND (length lt 50)

I hope it can be parsed into:

Group 1: (weight gt 10) OR (weight lt 100)
Group 2: AND
Group 3: length lt 50

And if the order of this changes:

(length lt 50) AND ((weight gt 10) OR (weight lt 100))

I hope it can be parsed into:

Group 1: length lt 50
Group 2: AND
Group 3: (weight gt 10) OR (weight lt 100)

The costest I tried is this expression:

(\((?>[^()]+|(?1))*\))

The problem of this is that it only worked partially:

((weight gt 10) OR (weight lt 100)) AND (length lt 50)

Group 1: ((weight gt 10) OR (weight lt 100))
Group 2: (length lt 50)

(length lt 50) AND ((weight gt 10) OR (weight lt 100))

Group 1: (length lt 50)
Group 2: ((weight gt 10) OR (weight lt 100))

The logcal operator is not selected as a group.

How can I fix this to catch the logical operator AND?

ZZZ
  • 645
  • 4
  • 17
  • 1
    You'll need a proper parser instead. Also, please tag your programming language. Another way could be to use a recursive approach and test the bricks for remaining parentheses, see https://regex101.com/r/lxp0Yq/1/ – Jan Apr 30 '21 at 20:06
  • How about `(weight gt 10) OR (weight lt 100)`? How do you expect that to come out? – Scratte Apr 30 '21 at 20:43

4 Answers4

5

With your shown samples, please try following regex, tested and written in Python3.8

^(?:\((\(weight.*?\))|\((length[^)]*))\)\s+(AND)\s+(?:\((\(weight.*?\).*?\))|\((length[^)]*)\))$

OR Generic solution:

^(?:\((\(.*?\))|\((\w+[^)]*))\)\s+(\S+)\s+(?:\((\(\w+.*?\).*?\))|\((\w+[^)]*)\))$

Here is the complete code of python3: Following results are with sample specific regex, just changed regex to Generic one(shown above) and it should work for generic values too.

import re
##Scenario 1st here...
var="""((weight gt 10) OR (weight lt 100)) AND (length lt 50)"""
li = re.findall(r'^(?:\((\(weight.*?\))|\((length[^)]*))\)\s+(AND)\s+(?:\((\(weight.*?\).*?\))|\((length[^)]*)\))',var)
[('(weight gt 10) OR (weight lt 100)', '', 'AND', '', 'length lt 50')]

##Scenario 2nd here.
var="""(length lt 50) AND ((weight gt 10) OR (weight lt 100))
li = re.findall(r'^(?:\((\(weight.*?\))|\((length[^)]*))\)\s+(AND)\s+(?:\((\(weight.*?\).*?\))|\((length[^)]*)\))',var)
[('', 'length lt 50', 'AND', '(weight gt 10) OR (weight lt 100)', '')]

##Remove null elements in 1st scenario's find command here.
[string for string in li[0] if string != ""]
['(weight gt 10) OR (weight lt 100)', 'AND', 'length lt 50']

##Remove null elements came in 2nd scenario's find command here.
[string for string in li[0] if string != ""]
['length lt 50', 'AND', '(weight gt 10) OR (weight lt 100)']

Explanation: Adding detailed explanation for above regex.

^                                          ##Checking from starting of value.
(?:                                        ##Creating a non-capturing group here.
  \(                                       ##Matching literal \ here.
  (\(weight.*?\))|\((length[^)]*))\        ##Creating 1st capturing group to match weight till ) OR length before ) as per need.
)                                          ##Closing 1st non-capturing group here.
\s+                                        ##Matching 1 or more occurrences of spaces here.
(AND)                                      ##Matching AND and keeping it in 2nd capturing group here.
\s+                                        ##Matching 1 or more occurrences of spaces here.
(?:                                        ##Creating 2nd capturing group here.
  \(                                       ##Matching literal \ here.
  (\(weight.*?\).*?\))|\((length[^)]*)\)   ##Creating 3rd capturing group here which is matching either weight till ) 2nd occurrence OR length just before ) as per need.
)$                                         ##Closing 2nd non-capturing group at end of value here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • Thanks for your comment. I forgot to mentioned that the 'weight' and 'length' are just examples. It can be any words actually. – ZZZ Apr 30 '21 at 21:38
  • @ZZZ, sure, I have changed it to more generic solution now, tested it with samples and it looked fine to me, let me know how it goes, thank you. – RavinderSingh13 Apr 30 '21 at 21:45
  • Thanks for the updated answer. It seems working. However, for ((weight gt 10) OR (weight lt 100)) AND (length lt 50), the result group is 1,3,5 rather than 1,2,3. For (length lt 50) AND ((weight gt 10) OR (weight lt 100)), the result group is 2,3,4 rather than 1,2,3. – ZZZ Apr 30 '21 at 23:07
  • @ZZZ, yeah, that's why I had removed null groups of you see last 2 codes, then they will come on correct place. Once you remove null groups you could use them with 1,2,3 etc group numbers, thank you. – RavinderSingh13 Apr 30 '21 at 23:24
5

You were almost there. The only missing bit was that the logical expression isn't being contained in parenthesis. The AND and OR that you want to capture. Your regex expression requires everything to sit in the middle of parenthesis.

Also what you call groups seems to actually be matches, where

((weight gt 10) OR (weight lt 100)) AND (length lt 50)

is matched twice:

  • First match is ((weight gt 10) OR (weight lt 100))
  • Second match is (length lt 50)

There are only two groups in your expression and they are identical, since group1 (g1), the outer most parenthesis, really is the entire expression (g0).

Since your expression matches any contained logic, I just expanded it a bit, adding a enclosing optional non-caputing group, that consists of capturing groups that you presented:

(?:([^()]+)((?1)))?

Combined it becomes

(\((?>[^()]+|(?1))*\))(?:([^()]+)((?1)))?
^----------- g1 -----^    ^-g2--^^-g3-^

The (?1) still references the first group as in your original expression. All the below are matches and their respective groups:

(weight gt 10)
^--- g1 -----^

(weight gt 10) OR (weight lt 100)
^--- g1 -----^ g2 ^--- g3 ------^

((weight gt 10) OR (weight lt 100)) AND (length lt 50)
^-------------- g1 ---------------^  g2 ^--- g3 -----^

(length lt 50) AND ((weight gt 10) OR (weight lt 100))
^--- g1 -----^  g2 ^------------- g3 ----------------^

(length lt 50) nonsense ((weight gt 10) OR (weight lt 100))
^--- g1 -----^    g2    ^-------------- g3 ---------------^

The character glass only excludes parenthesis, so any nonsense is matched.


Your expression broken down:

(            # capturing group 1
  \(         # match a `(` literally
  (?>        # atomic/independent, non-capturing group (meaning no backtracking into the group)
    [^()]    # any character that is not `(` nor `)` 
    +        # one or more times
   |         # or
    (?1)     # recurse group 1.
             # ..this is like a copy of the expression of group 1 here.
             # ..which also includes this part.
             # ..so it's sort of self-recursing
  )*         # zero or more times
  \)         # match a `)` literally
)

The addition broken down:

(?:          # non-capturing group     
  (          # capturing group 2
    [^()]    # any character that is not `(` nor `)` 
    +        # one or more times
  )
  (          # capturing group 3
    (?1)     # recurse group 1.
  )
)?           # zero or one time

The expression at regex101. Here I changed the character class to [^()\n] to avoid issues with newlines.

Scratte
  • 3,056
  • 6
  • 19
  • 26
  • Thanks for your reply. I tried (\((?>[^()]+|(?1))*\))(?:([^()]+)((?1)))? in Java. However, it complains the number 1 inside the regex. This regex is too complicated for me to understand. Can you suggest how I can change the 1 in Java? – ZZZ Apr 30 '21 at 22:52
  • @ZZZ You cannot use this expression in Java. That includes your original expression. Java does not have recursive groups, which is the the `(?1)` used. Groups are counted by their beginning parenthesis which in this case are not part of the `(?>…)` nor the `(?1)`. The very first `(` is also the first group, and the `(?1)` means to recurse it, meaning "repeat the expression of group 1". – Scratte Apr 30 '21 at 22:59
  • @ZZZ If you're interested, you can learn more about recursive groups from RexEgg's [Recursive Regular Expressions](http://rexegg.com/regex-recursion.html). Note that `(?R)` is the same as `(?0)` which means the entire expression. – Scratte Apr 30 '21 at 23:29
3

Regex is the wrong tool for the job, you need to analyze each character and yield the content when parentheses open or close. In Python this could be (shamelessly copied and further modified from here):

def parenthetic_contents(string):
    """Generate parenthesized contents in string as pairs (level, contents)."""
    stack = []
    buffer = ""
    for i, c in enumerate(string):
        if c == '(':
            if buffer:
                yield (0, buffer.strip())
            buffer = ""
            stack.append(i)
        elif c == ')' and stack:
            start = stack.pop()
            yield (len(stack), string[start + 1: i])
        elif c not in ['(', ')'] and not stack:
            buffer += c


for part in parenthetic_contents("(length lt 50) AND ((weight gt 10) OR (weight lt 100))"):
    print(part)

Which would yield

(0, 'length lt 50')
(0, 'AND')
(1, 'weight gt 10')
(1, 'weight lt 100')
(0, '(weight gt 10) OR (weight lt 100)')
Jan
  • 42,290
  • 8
  • 54
  • 79
1

You might use your expression and add an inner group to get the balanced parenthesis part inside the outer parenthesis.

It does not give group 1, 2 and 3 but it captures the balanced inner part in group 2 , and the word like AND in group 3 while repeating the outer group 1.

The \G anchor (if supported) will get consecutive matches when there are more occurences of AND for example.

(?:(\(((?>[^()]+|(?1))*)\))|\G(?!^)(?:\s*(\w+)\s*)(?=\())
  • (?: Non capture group
    • ( Capture group 1
      • \( Match (
      • ( Capture group 2 (like length lt 50)
        • (?>[^()]+|(?1))* Atomic group, recurse the first subpattern
      • ) Close group 2
      • \) Match )
    • ) Close group 1
    • | Or
    • \G(?!^) Assert the position at the end of previous match (Which started at the start of the string matching the opening parenthesis)
    • (?:\s*(\w+)\s*)(?=\() Capture 1+ word chars in group 3 (like AND) only when there is a ( to the right
  • ) Close non capture group

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70