1

I'm trying to use pyparsing==2.4.7 to parse search queries that have a field:value format.

Examples of the strings I want to parse include:

field1:value1
field1:value1 field2:value2
field1:value1 AND field2:value2
(field1:value1a OR field1:value1b) field2:value2
(field1:value1a | field1:value1b) & (field2:value2a | field2:value2b)

A few things to note:

  • I'm using OR and | to both mean "OR", same with AND and & meaning the same thing
  • If there is no boolean operator between conditions, then an AND is implied
  • Queries can be nested hierarchically with parentheses
  • The values (on the right side of the :) will never have spaces

I have written a parser that works (code is based on this SO answer), but only for when all of the operators are present (AND and OR):

import pyparsing as pp
from pyparsing import Word, alphas, alphanums, White, Combine, OneOrMore, Literal, oneOf 

field_name = Word(alphanums).setResultsName('field_name')

search_value = Word(alphanums + '-').setResultsName('search_value')

operator = Literal(':')

query = field_name + operator + search_value

AND = oneOf(['AND', 'and', '&', ' '])
OR = oneOf(['OR', 'or', '|'])
NOT = oneOf(['NOT', 'not', '!'])

query_expr = pp.infixNotation(query, [
    (NOT, 1, pp.opAssoc.RIGHT, ),
    (AND, 2, pp.opAssoc.LEFT, ),
    (OR, 2, pp.opAssoc.LEFT, ),
])

class ComparisonExpr:
    def __init__(self, tokens):
        self.tokens = tokens
    def __str__(self):
        return "Comparison:('field': {!r}, 'operator': {!r}, 'value': {!r})".format(*self.tokens)
    def __repr__(self):
        return self.__str__()

query.addParseAction(ComparisonExpr)

sample = "(field1:value1a | field1:value1b) & (field2:value2a | field2:value2b)"

result = query_expr.parseString(sample).asList()

from pprint import pprint
>>> pprint(result)

[[[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'),
   '|',
   Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')],
  '&',
  [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'),
   '|',
   Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]]]

However, if I try it with a sample that is missing a operator, the parser appears to stop at the point where an operator would be expected:

sample = "(field1:value1a | field1:value1b) (field2:value2a | field2:value2b)"

result = query_expr.parseString(sample).asList()
from pprint import pprint
pprint(result)

[[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'),
  '|',
  Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')]]

Is there a way to make whitespace an "implicit AND" if there is no operator separating terms?

Nathan Jones
  • 4,904
  • 9
  • 44
  • 70

1 Answers1

1

Short answer:

Replace your definition of AND with:

AND = oneOf(['AND', 'and', '&']) | pp.Empty()

Some other suggestions:

For easier post-parse processing, you may want the Empty() to actually emit a "&" operator. You can do that with a parse action:

AND = oneOf(['AND', 'and', '&']) | pp.Empty().addParseAction(lambda: "&")

In fact, you can normalize all your operators to just "&", "|", and "!", again, to skip any "if operator == 'AND' or operator == 'and' or ..." code. Put your parse action on the whole expression:

AND = (oneOf(['AND', 'and', '&']) | pp.Empty()).addParseAction(lambda: "&")
OR = oneOf(['OR', 'or', '|']).addParseAction(lambda: "|")
NOT = oneOf(['NOT', 'not', '!']).addParseAction(lambda: "!")

Also, considering that you are now accepting "" as equivalent to "&", you should make pyparsing treat your operators like keywords - so there is no confusion if "oregon" is not "or egon". Add the asKeyword argument to all your oneOf expressions:

AND = (oneOf(['AND', 'and', '&'], asKeyword=True)
       | pp.Empty()).addParseAction(lambda: "&")
OR = oneOf(['OR', 'or', '|'], asKeyword=True).addParseAction(lambda: "|")
NOT = oneOf(['NOT', 'not', '!'],  asKeyword=True).addParseAction(lambda: "!")

Lastly, when you want to write test strings, you can skip the looping over strings, or catching ParseExceptions - just use runTests:

query_expr.runTests("""\
    (field1:value1a | field1:value1b) & (field2:value2a | field2:value2b)
    (field1:value1a | field1:value1b) (field2:value2a | field2:value2b)
    """)

Will print each test string, followed by the parsed results or the parse exception and '^' where the exception occurred:

(field1:value1a | field1:value1b) & (field2:value2a | field2:value2b)
[[[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')], '&', [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]]]
[0]:
  [[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')], '&', [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]]
  [0]:
    [Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')]
  [1]:
    &
  [2]:
    [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]

(field1:value1a | field1:value1b) (field2:value2a | field2:value2b)
[[[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')], '&', [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]]]
[0]:
  [[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')], '&', [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]]
  [0]:
    [Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')]
  [1]:
    &
  [2]:
    [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • Many thanks for the quick answer, as well as for your suggestions! – Nathan Jones Sep 27 '21 at 16:21
  • Now I want to iterate over these results in a depth-first traversal. Do you have any ideas or suggestions on how to do that? My end goal is to convert the parse results to a [django-haystack SearchQuerySet filter expression](https://django-haystack.readthedocs.io/en/master/searchqueryset_api.html). – Nathan Jones Sep 27 '21 at 16:40
  • 1
    Please look at the example code for SimpleBool.py, to see how to build your own AST by adding AST node classes for the various operator levels (the class is the optional 4th element in each operator tuple). Then you should be able to use these classes to construct your filter expression, by writing a method like `create_filter_string` in place of the `evaluate` methods you'll see in SimpleBool.py. If you have these methods call the `create_filter_string` methods of their contained elements, that will do the recursion for you. – PaulMcG Sep 27 '21 at 17:29
  • Sorry, I'm not seeing any methods named `evaluate` in the [SimpleBool.py example](https://github.com/pyparsing/pyparsing/blob/master/examples/simpleBool.py). Could you please elaborate on what you mean by that? – Nathan Jones Sep 27 '21 at 17:55
  • After reading the example code more closely, I think I can follow it now. Thanks again for your help. – Nathan Jones Sep 27 '21 at 18:03