2

I'm trying to solve the following problem using Pyparsing: I want to search a string for occurrences of three types of elements:

  1. lowercase words
  2. lowercase words following the literal string "OBJ"
  3. lists containing one of these elements, separated by ','

An example string could be "foo bar OBJ baz foo,bar

I want to process each of these elements in its own parse action.

Here is my code:

import pyparsing
from pyparsing import Word, Literal, alphas

def found_word(s, l, t):
    print('word')
def found_obj(s, l, t):
    print('obj')
def found_list(s, l, t):
    print('list')

def process(string):

    word = ~Literal('OBJ ') + Word(alphas.lower())
    word.setParseAction(lambda s,l,t: found_word(s, l, t))
    obj = Literal('OBJ ') +  Word(alphas.lower())
    obj.setParseAction(lambda s,l,t: found_obj(s, l, t))
    item = word | obj
    list = pyparsing.delimitedList(item, delim=',')
    list.setParseAction(lambda s,l,t: found_list(s, l, t))
    element = word | obj | list

    parser = pyparsing.OneOrMore(element)
    parser.searchString(string).pprint()

if __name__ == "__main__":
    process('foo bar OBJ baz foo,bar')

Edit: I've put some test output inside the parseActions just to see if they are getting called. The desired output would be:

word
word
obj
word
word
list

The actual output is:

word
word
obj
word
word

I.e. the parseAction for the list is not called. How do I need to change my code in order to achieve this?

Update The delimitedList isn't working as I expected. When I call

pyparsing.OneOrMore(list).searchString('foo,bar baz')

found_list seems to be called twice, although there is only one list element in my string:

word
word
list
word
list
drkncls
  • 21
  • 2

2 Answers2

0

Try this:

s = 'foo bar OBJ baz foo,bar'


for w in s.split(' '):
  if w.islower():
    print("word")
  if 'OBJ' in w:
    print("obj")
  if ',' in w:
    print('list')
  • I 've used the `split()` to make a list with the string. After, I used a group of conditions to test each item of the list. To me, I got the output desired! – Michel Guimarães Apr 19 '20 at 15:00
  • Thanks for the answer! This way I don't get the elements grouped together like specified in the question. Also, separating the elements with split(' ') doesn't work if there is 'OBJ foo' inside the list. – drkncls Apr 19 '20 at 15:19
0

The reason your list is not being parsed lies in this expression:

element = word | obj | list

Because you are checking for word before list (which is a really awful variable name when working in Python, btw), then the leading "foo" in "foo,bar" is being processed as a word, since '|' is an eager operator, matching on the first matching expression.

You can fix this by changing the order of expressions in element:

element = list | word | obj

Or by using '^' instead of '|'. '^' is a patient operator - it evaluates all of the alternative expressions and selects the longest match.

element = word ^ obj ^ list

With either of these changes, your output now becomes:

word
list
word
list
obj
word
word
list

Why all the list matching? Because delimitedList will match a single item:

>>> wd = Word(alphas)
>>> wdlist = delimitedList(wd)
>>> print(wdlist.parseString('xyz'))
['xyz']

If you want to enforce that lists must have > 1 item, then you can add a condition parse action:

>>> wdlist.addCondition(lambda t: len(t)>1)
>>> print(wdlist.parseString('xyz')) 
... raises exception ...

Also, delimitedLists do not automatically group their results:

>>> print((wd + wdlist).parseString('xyz abc,def'))
['xyz', 'abc', 'def']

If you want to keep the list contents as a list in the results, then wrap the list expression in a Group:

>>> print((wd + Group(wdlist)).parseString('xyz abc,def'))
['xyz', ['abc', 'def']]

Here is my updated version of your process() method:

def process(string):
    print(string)

    word = ~Literal('OBJ') + Word(alphas.lower())
    word.addParseAction(lambda s,l,t: found_word(s, l, t))
    word.setName("word")
    obj = Literal('OBJ') +  Word(alphas.lower())
    obj.setName("obj")
    obj.addParseAction(lambda s,l,t: found_obj(s, l, t))
    item = word | obj
    list = Group(pyparsing.delimitedList(item, delim=',')
                    .addCondition(lambda t: len(t)>1))
    list.setName("list")
    list.addParseAction(lambda s,l,t: found_list(s, l, t))
    element = obj | list | word

    parser = pyparsing.OneOrMore(element)
    parser.searchString(string).pprint()

Which gives this output:

foo bar OBJ baz foo,bar
word
word
word
word
obj
word
word
list
[['foo', 'bar', 'OBJ', 'baz', ['foo', 'bar']]]

You'll note that I added setName() calls for each of your expressions. That is so that I could add setDebug() to get pyparsing's debug output. By adding:

word.setDebug()
obj.setDebug()
list.setDebug()

before calling parseString, you get this debugging output. It may help explain why you are getting the replicated "word"s in your sample output.

foo bar OBJ baz foo,bar
Match obj at loc 0(1,1)
Exception raised:Expected "OBJ", found 'f'  (at char 0), (line:1, col:1)
Match list at loc 0(1,1)
Match word at loc 0(1,1)
word
Matched word -> ['foo']
Exception raised:failed user-defined condition, found 'f'  (at char 0), (line:1, col:1)
Match word at loc 0(1,1)
word
Matched word -> ['foo']
Match obj at loc 3(1,4)
Exception raised:Expected "OBJ", found 'b'  (at char 4), (line:1, col:5)
Match list at loc 3(1,4)
Match word at loc 4(1,5)
word
Matched word -> ['bar']
Exception raised:failed user-defined condition, found 'b'  (at char 4), (line:1, col:5)
Match word at loc 3(1,4)
word
Matched word -> ['bar']
Match obj at loc 7(1,8)
obj
Matched obj -> ['OBJ', 'baz']
Match obj at loc 15(1,16)
Exception raised:Expected "OBJ", found 'f'  (at char 16), (line:1, col:17)
Match list at loc 15(1,16)
Match word at loc 16(1,17)
word
Matched word -> ['foo']
Match word at loc 20(1,21)
word
Matched word -> ['bar']
list
Matched list -> [['foo', 'bar']]
Match obj at loc 23(1,24)
Exception raised:Expected "OBJ", found end of text  (at char 23), (line:1, col:24)
Match list at loc 23(1,24)
Match word at loc 23(1,24)
Exception raised:Expected W:(abcd...), found end of text  (at char 23), (line:1, col:24)
Match obj at loc 23(1,24)
Exception raised:Expected "OBJ", found end of text  (at char 23), (line:1, col:24)
Exception raised:Expected {word | obj}, found end of text  (at char 23), (line:1, col:24)
Match word at loc 23(1,24)
Exception raised:Expected W:(abcd...), found end of text  (at char 23), (line:1, col:24)
Match obj at loc 23(1,24)
Exception raised:Expected "OBJ", found end of text  (at char 23), (line:1, col:24)
Match list at loc 23(1,24)
Match word at loc 23(1,24)
Exception raised:Expected W:(abcd...), found end of text  (at char 23), (line:1, col:24)
Match obj at loc 23(1,24)
Exception raised:Expected "OBJ", found end of text  (at char 23), (line:1, col:24)
Exception raised:Expected {word | obj}, found end of text  (at char 23), (line:1, col:24)
Match word at loc 23(1,24)
Exception raised:Expected W:(abcd...), found end of text  (at char 23), (line:1, col:24)
[['foo', 'bar', 'OBJ', 'baz', ['foo', 'bar']]]
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • delimitedList now takes min and max arguments also, so a list of 2 or more would done by passing min=2 in the constructor. – PaulMcG Apr 28 '23 at 21:28