8

I am struggling to parse nested structures with PyParsing. I've searched many of the 'nested' example uses of PyParsing, but I don't see how to fix my problem.

Here is what my internal structure looks like:

texture_unit optionalName
{
    texture required_val
    prop_name1 prop_val1
    prop_name2 prop_val1
}

and here is what my external structure looks like, but it can contain zero or more of the internal structures.

pass optionalName
{
    prop_name1 prop_val1
    prop_name2 prop_val1

    texture_unit optionalName
    {
        // edit 2: showing use of '.' character in value
        texture required_val.file.name optional_val // edit 1: forgot this line in initial post.

        // edit 2: showing potentially multiple values
        prop_name3 prop_val1 prop_val2
        prop_name4 prop_val1
    }
}

I am successfully parsing the internal structure. Here is my code for that.

prop_ = pp.Group(pp.Word(pp.alphanums+'_')+pp.Group(pp.OneOrMore(pp.Word(pp.alphanums+'_'+'.'))))
texture_props_ = pp.Group(pp.Literal('texture') + pp.Word(pp.alphanums+'_'+'.')) + pp.ZeroOrMore(prop_)
texture_ = pp.Forward()
texture_ << pp.Literal('texture_unit').suppress() + pp.Optional(pp.Word(pp.alphanums+'_')).suppress() + pp.Literal('{').suppress() + texture_props_ + pp.Literal('}').suppress()

Here is my attempt to parse the outer structure,

pass_props_ = pp.ZeroOrMore(prop_)
pass_ = pp.Forward()
pass_ << pp.Literal('pass').suppress() + pp.Optional(pp.Word(pp.alphanums+'_'+'.')).suppress() + pp.Literal('{').suppress() + pass_props_ + pp.ZeroOrMore(texture_) + pp.Literal('}').suppress()

When I say: pass_.parseString( testPassStr )

I see errors in the console that "}" was expected.

I see this as very similar to the C struct example, but I'm not sure what is the missing magic. I'm also curious how to control the resulting data structure when using the nestedExpr.

Community
  • 1
  • 1
cyrf
  • 5,127
  • 6
  • 25
  • 42
  • Here is another example supporting nested structures. It looks like it uses a 'pyparsing.Dict'. All these examples show a different way to achieve the nested parsing, what is the commonality? http://pyparsing.wikispaces.com/share/view/40834661 – cyrf May 28 '14 at 17:45

2 Answers2

4

There are two problems:

  1. In your grammar you marked texture literal as required in texture_unit block, but there is no texture in your second example.
  2. In second example, pass_props_ coincides with texture_unit optionalName. After it, pp.Literal('}') expects }, but gives {. This is the reason for the error.

We can check it by changing the pass_ rule like this:

pass_ << pp.Literal('pass').suppress() + pp.Optional(pp.Word(pp.alphanums+'_'+'.')).suppress() + \
             pp.Literal('{').suppress() + pass_props_

print pass_.parseString(s2)

It gives us follow output:

[['prop_name', ['prop_val', 'prop_name', 'prop_val', 'texture_unit', 'optionalName']]]

We can see that pass_props_ coincides with texture_unit optionalName.
So, what we want to do: prop_ can contains alphanums, _ and ., but can not match with texture_unit literal. We can do it with regex and negative lookahead:

prop_ = pp.Group(  pp.Regex(r'(?!texture_unit)[a-z0-9_]+')+ pp.Group(pp.OneOrMore(pp.Regex(r'(?!texture_unit)[a-z0-9_.]+'))) )

Finally, working example will look like this:

import pyparsing as pp

s1 = '''texture_unit optionalName
    {
    texture required_val
    prop_name prop_val
    prop_name prop_val
}'''

prop_ = pp.Group(  pp.Regex(r'(?!texture_unit)[a-z0-9_]+')+ pp.Group(pp.OneOrMore(pp.Regex(r'(?!texture_unit)[a-z0-9_.]+'))) )
texture_props_ = pp.Group(pp.Literal('texture') + pp.Word(pp.alphanums+'_'+'.')) + pp.ZeroOrMore(prop_)
texture_ = pp.Forward()
texture_ = pp.Literal('texture_unit').suppress() + pp.Word(pp.alphanums+'_').suppress() +\
           pp.Literal('{').suppress() + pp.Optional(texture_props_) + pp.Literal('}').suppress()

print texture_.parseString(s1)

s2 = '''pass optionalName
{
    prop_name1 prop_val1.name
    texture_unit optionalName1
    {
        texture required_val1
        prop_name2 prop_val12
        prop_name3 prop_val13
    }
    texture_unit optionalName2
    {
        texture required_va2l
        prop_name2 prop_val22
        prop_name3 prop_val23
    }
}'''

pass_props_ = pp.ZeroOrMore(prop_  )
pass_ = pp.Forward()

pass_ = pp.Literal('pass').suppress() + pp.Optional(pp.Word(pp.alphanums+'_'+'.')).suppress() +\
        pp.Literal('{').suppress() + pass_props_ + pp.ZeroOrMore(texture_ ) + pp.Literal('}').suppress()

print pass_.parseString(s2)

Output:

[['texture', 'required_val'], ['prop_name', ['prop_val', 'prop_name', 'prop_val']]]
[['prop_name1', ['prop_val1.name']], ['texture', 'required_val1'], ['prop_name2', ['prop_val12', 'prop_name3', 'prop_val13']], ['texture', 'required_va2l'], ['prop_name2', ['prop_val22', 'prop_name3', 'prop_val23']]]
NorthCat
  • 9,643
  • 16
  • 47
  • 50
  • 1. You are correct my nested example was missing the required 'texture' property. This was a typo when posting. I will correct it in the post. – cyrf May 27 '14 at 16:33
  • @cyrf What about the second item and the solution for him? – NorthCat May 27 '14 at 16:48
  • about #2, Thanks for the great suggestion. I'm still testing it. I am trying to understand why the 'negative lookahead' was not needed in the C Struct Parser example, which supports nested C structs (linked in my original post). – cyrf May 27 '14 at 18:02
  • The new definition of 'prop_' breaks parsing of the internal structure. I should have made the test more explicit, I was trying to make it general and readable. I will edit my post now to better specify what I need. Can you comment on why the C struct example did not need 'negative lookahead'? – cyrf May 27 '14 at 21:42
  • If the internal structure does not use a '.' in the prop_val, then the internal structure is parsed. However, parsing the outer structure with your changes still yields errors. – cyrf May 27 '14 at 21:55
  • my apologies, your code works. I did have to change the 2nd assignments of texture_ and pass_ to use the << operator (instead of the = operator), but it is working. I'm not sure why it does not work when I pull it into my code. Checking... – cyrf May 27 '14 at 23:53
1

The answer I was looking for is related to the use of the 'Forward' parser, shown in the Cstruct example (linked in OP).

The hard part of defining grammar for nested strcture is to define all the possible member types of the structure, which needs to include the structure itself, which is still not defined.

The "trick" to defining the pyparsing grammar for a nested structure is to delay the definition of the structure, but include a "forward declared" version of the structure when defining the structure members, so the members can also include a structure. Then complete the structure grammar as a list of members.

struct = Forward()
member = blah | blah2 | struct
struct << ZeroOrMore( Group(member) )

This is also discussed over here: Pyparsing: Parsing semi-JSON nested plaintext data to a list

The OP (mine) described test data and grammar that was not specific enough and matched when it should have failed. @NorthCat correctly spotted the undesired matches in the grammar. However, the suggestion to define many 'negative lookaheads' seemed unmanageable.

Instead of defining what should not match, my solution instead explicitly listed the possible matches. The matches were member keywords, using 'oneOf('list of words separated by space'). Once I specified all the possible matches, I realized my structure was not a nested structure, but actually a structure with finite depth and different grammars described each depth. So, my member definition did not require the Forward declaration trick.

The terminator of my member definitions was different than in the Cstruct example. Instead of terminating with a ';' (semi-colon) like in C++, my member definitions needed to terminate at the end of the line. In pyparsing, you can specify the end of the line with 'LineEnd' parser. So, I defined my members as a list of values NOT including the 'LineEnd', like this, notice the use of the "Not" (~) operator in the last definition:

EOL = LineEnd().suppress()
ident = Word( alphas+"_", alphanums+"_$@#." )
integer = Word(nums)
real = Combine(Optional(oneOf('+ -')) + Word(nums) + '.' + Optional(Word(nums)))
propVal = real | integer | ident
propList = Group(OneOrMore(~EOL + propVal))
Community
  • 1
  • 1
cyrf
  • 5,127
  • 6
  • 25
  • 42