1

I have a series of conditional expressions as strings like: "('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))" which are nested to a non-predictable depth. I have no control over this data; I need to be able to work with it.

I'd like python to be able to understand these structures so that I can manipulate them .

I've tried creating nested python lists by breaking at the parentheses (using pyparsing or unutbu's/falsetru's parser):

>>> str1="('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))"
>>> var1=parse_nested(text, r'\s*\(', r'\)\s*')
>>> var1
[["'a' AND", ["'b' OR 'c'"]], 'OR', ["'d' AND", ["'e' OR", ["'g' AND 'h'"]]]]

...But then to be honest I don't know how to get python to interpret the AND/OR relationships between the objects. I feel like i'm going completely the wrong direction with this.

My ideal output would be to be a data structure that maintains the relationship types between the entities in a nested way, so that (for example) it would be easy to create a json or YAML:

OR:
- AND:
  - a
  - OR:
    - b
    - c
- AND:
  - d
  - OR:
    - e
    - AND:
      - g
      - h
AWaL
  • 33
  • 5
  • You need to specify the structure of the expressions much more clearly. What *exactly* can appear between those single quotes? Are double quotes ever used instead of single quotes? How are embedded quotes escaped? Can the quoted terms ever contain embedded `( ) AND OR`? Are the terms *always* quoted, or are other forms possible? – ekhumoro Sep 17 '22 at 20:35
  • Sure: Yes, the terms are always quoted (There may be extensions when i bring in other data sets in future where numbers are not quoted, but for now, this will do.)The single quotes can contain any string - the only rule will be that any single or double quotes will be escaped. outside of that, they could contain any character set at all. – AWaL Sep 18 '22 at 09:49
  • But *exactly how* is the escaping done? More generally, are the quoted terms syntactically valid python strings? If so, the escapes would have to be doubled: e.g. `"('\\'a\\'' OR 'b')"`. – ekhumoro Sep 18 '22 at 11:43
  • Oh I see. Yes, they are as you describe. – AWaL Sep 18 '22 at 15:21
  • 1
    I'm going to recommend the "Lark" module. – Erotemic Sep 18 '22 at 16:51
  • @AWaL There was a small bug in my example code that [I've now fixed](https://stackoverflow.com/posts/73764684/revisions). It might have been hard to spot if you weren't aware of it, so I thought I'd better let you know. – ekhumoro Sep 19 '22 at 11:21

3 Answers3

1

As far as I know there isn't a builtin or something like that for this. I would just loop over it and check for the separators, then you can separate them and remove spaces.

2Bored
  • 41
  • 6
1

Assuming the quoted terms are syntactically valid python strings, it should be possible to use the tokenize module to parse the input. The only additonal requirement is that all embedded escapes are doubled. Also note that the string tokens returned by the tokeniser are always quoted, so they can be safely eval'd.

The demo script below provides a basic implementation. The output is a list of lists, with AND/OR coverted to singleton types - but if you need a different structure it should be simple enough to adapt the code accordingly:

import tokenize, token as tokens
from io import StringIO

class ParseError(Exception):
    def __init__(self, message, tokens):
        super().__init__(f'{message}: {tokens}')

AND = type('AND', (object,), {'__repr__': lambda x: 'AND'})()
OR = type('OR', (object,), {'__repr__': lambda x: 'OR'})()

NAMES = {'AND': AND, 'OR': OR, 'TRUE': True, 'FALSE': False}
OPS = {tokens.LPAR, tokens.RPAR}
OTHER = {
    tokens.ENDMARKER, tokens.INDENT, tokens.DEDENT,
    tokens.NEWLINE, tokens.NL, tokens.COMMENT,
    }

def parse_expression(string):
    string = StringIO(string)
    result = current = []
    stack = [current]
    for token in tokenize.generate_tokens(string.readline):
        if (kind := token.type) == tokens.OP:
            if token.exact_type == tokens.LPAR:
                current.append(current := [])
                stack.append(current)
            elif token.exact_type == tokens.RPAR:
                current = stack.pop()
            else:
                raise ParseError('invalid operator', token)
        elif kind == tokens.NAME:
            if (name := token.string.upper()) in NAMES:
                current.append(NAMES[name])
            else:
                raise ParseError('invalid name', token)
        elif kind == tokens.STRING or kind == tokens.NUMBER:
            current.append(eval(token.string))
        elif kind not in OTHER:
            raise ParseError('invalid token', token)
    return result


str1 = "('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))"
str2 = "('(AND)' AND (42 OR True)) or ('\\'d\\'' AND ('e' OR ('g' AND 'h')))"

print(parse_expression(str1))
print(parse_expression(str2))

Output:

[['a', AND, ['b', OR, 'c'], OR, ['d', AND, ['e', OR, ['g', AND, 'h']]]]]
[['(AND)', AND, [42, OR, True], OR, ["'d'", AND, ['e', OR, ['g', AND, 'h']]]]]
ekhumoro
  • 115,249
  • 20
  • 229
  • 336
1

Lark has what you need:

https://github.com/lark-parser/lark

import lark
SHAPE_GRAMMAR = (
    r'''
    ?start: expr

    expr                 : VARIABLE | bin_expr | expr (WS BIN_OP WS expr)* | "(" WS* expr WS* ")"

    // A binary expression
    bin_expr  : VARIABLE WS BIN_OP WS VARIABLE

    // The valid binary operators
    BIN_OP :  "AND" | "OR" | "XOR" | "NOR" | "NAND"

    VARIABLE : ESCAPED_SQ_STRING

    // Can  also use lark directives like `%import common.WS`
    _STRING_INNER: /.*?/
    _STRING_ESC_INNER: _STRING_INNER /(?<!\\)(\\\\)*?/
    ESCAPED_SQ_STRING : "'" _STRING_ESC_INNER "'"
    WS: /[ \t\f\r\n]/+
    ''')

parser = lark.Lark(SHAPE_GRAMMAR,  start='start', parser='earley')
print(parser.parse("'A'").pretty())
print(parser.parse("'A' OR 'c'").pretty())
print(parser.parse("'A' OR 'B' OR 'c'").pretty())
print(parser.parse("('A' OR 'B' )").pretty())
print(parser.parse("('A' OR 'B' OR 'C' )").pretty())
print(parser.parse("('A' OR ('B' OR 'C') )").pretty())

output = parser.parse("('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))")
print(output.pretty())

Look at tutorials for how to work with the tree structure with thins like NodeVisitors / Transformers.

The pretty output will give you an idea of the tree structure, but I think the networkx forest_str function (which will hopefully be deprecated soon for write_network_text)

Here is a little bit of code I wrote to convert the above tree into a networkx graph, which I like using:

import networkx as nx  # NOQA


def lark_dfs_edges(self):
    id_counter = 0
    from lark.tree import Tree
    stack = [(None, self)]
    while stack:
        parent, node = stack.pop()
        if parent is None:
            parent_id = None
        else:
            parent_id = id(parent)
        if isinstance(node, Tree):
            node_id = id(node)
            yield node, parent_id, node_id, repr(node.data)
            for child in reversed(node.children):
                stack.append((node, child))
        else:
            # hack made a unique id for this leaf
            leaf_id = id_counter
            id_counter += 1
            yield node, parent_id, leaf_id, str(node)


def lark_to_networkx(self):
    tree = nx.DiGraph()
    for node, parent_id, node_id, label in lark_dfs_edges(self):
        tree.add_node(node_id, label=label, node=node)
        if parent_id is not None:
            tree.add_edge(parent_id, node_id)
    return tree
tree = lark_to_networkx(output)
print(nx.forest_str(tree))

That outputs something I think makes slightly more sense:

╙── Token('RULE', 'expr')
    ├─╼ Token('RULE', 'expr')
    │   └─╼ Token('RULE', 'expr')
    │       ├─╼ Token('RULE', 'expr')
    │       │   └─╼ 'a'
    │       ├─╼  
    │       ├─╼ AND
    │       ├─╼  
    │       └─╼ Token('RULE', 'expr')
    │           └─╼ Token('RULE', 'expr')
    │               └─╼ Token('RULE', 'bin_expr')
    │                   ├─╼ 'b'
    │                   ├─╼  
    │                   ├─╼ OR
    │                   ├─╼  
    │                   └─╼ 'c'
    ├─╼  
    ├─╼ OR
    ├─╼  
    └─╼ Token('RULE', 'expr')
        └─╼ Token('RULE', 'expr')
            ├─╼ Token('RULE', 'expr')
            │   └─╼ 'd'
            ├─╼  
            ├─╼ AND
            ├─╼  
            └─╼ Token('RULE', 'expr')
                └─╼ Token('RULE', 'expr')
                    ├─╼ Token('RULE', 'expr')
                    │   └─╼ 'e'
                    ├─╼  
                    ├─╼ OR
                    ├─╼  
                    └─╼ Token('RULE', 'expr')
                        └─╼ Token('RULE', 'expr')
                            └─╼ Token('RULE', 'bin_expr')
                                ├─╼ 'g'
                                ├─╼  
                                ├─╼ AND
                                ├─╼  
                                └─╼ 'h'
Erotemic
  • 4,806
  • 4
  • 39
  • 80
  • This is really cool - gives me a lot to think about. I enjoy graph data structures; I'm going to play around with this - thanks! – AWaL Sep 19 '22 at 09:14
  • 1
    Lark is what I wish pyparsing was. It's incredibly easy to use. The input actually looks like a grammar and not code. There is also a lark-cython optional module that helps speed it up. I've been very happy with it when writing small DSLs. – Erotemic Sep 19 '22 at 20:25