Boost.Spirit, how to extend xml parsing?

Question

I would like to extend xml parsing using Boost.Spirit, and would like to add parsing of xml attributes.

Here example from library and some modifications from me:

template <typename Iterator>
struct mini_xml_grammar
: qi::grammar<Iterator, mini_xml(), qi::locals<std::string>, ascii::space_type>
{
    mini_xml_grammar()
    : mini_xml_grammar::base_type(xml, "xml")
    {
        using qi::lit;
        using qi::lexeme;
        using qi::attr;
        using qi::on_error;
        using qi::fail;
        using ascii::char_;
        using ascii::string;
        using ascii::alnum;
        using ascii::space;

        using namespace qi::labels;

        using phoenix::construct;
        using phoenix::val;


        text %= lexeme[+(char_ - '<')];
        node %= xml | text;


        start_tag %=
        '<'
        >>  !lit('/')
        >   lexeme[+(char_ - '>')]
        >   '>'
        ;

        end_tag =
        "</"
        >   string(_r1)
        >   '>'
        ;

        xml %=
        start_tag[_a = _1]
        >   *node
        >   end_tag(_a)
        ;

        xml.name("xml");
        node.name("node");
        text.name("text");
        start_tag.name("start_tag");
        end_tag.name("end_tag");

        on_error<fail>
        (
         xml
         , std::cout
         << val("Error! Expecting ")
         << _4                               // what failed?
         << val(" here: \"")
         << construct<std::string>(_3, _2)   // iterators to error-pos, end
         << val("\"")
         << std::endl
         );
    }

    qi::rule<Iterator, mini_xml(), qi::locals<std::string>, ascii::space_type> xml;
    qi::rule<Iterator, mini_xml_node(), ascii::space_type> node;
    qi::rule<Iterator, std::string(), ascii::space_type> text;
    qi::rule<Iterator, std::string(), ascii::space_type> attribute;
    qi::rule<Iterator, std::string(), ascii::space_type> start_tag;
    qi::rule<Iterator, void(std::string), ascii::space_type> end_tag;
};

I've tried this, but it does not compile with error "use of undeclared identifier 'eps'":

        xml %= 
        start_tag[_a = _1] 
        > attribute 
        > (  "/>" > eps
            |  ">" > *node > end_tag(_a) 
            )
        ;

Does anyone know how to do it? How to add ability to parse xml attributes?

There are a *lot* of [options for XML parsing in C++.](http://stackoverflow.com/questions/9387610/what-xml-parser-should-i-use-in-c) Do you really need to hack together a Boost.Spirit parser for it? — Nicol Bolas, Feb 27 '12 at 23:37
Yes, I know, I worked with another libraries. I would like to learn Boost.Spirit, to understand how it works. RapidXML it seems good library, and it seems RapidXML library does not have a complete support of XML parsing. — ruslan.berliner, Feb 28 '12 at 11:08
@NicolBolas Thanks for your post about XML libraries, great investigation about it. — ruslan.berliner, Feb 28 '12 at 11:22
Note that your `eps` rule doesn't actually accomplish anything in the grammar. It appears at the *end* of a rule. Rules are already implicitly followed by "nothing." Use `eps` when you want to attach Spirit attribute-handling to a portion of the grammar that doesn't otherwise match anything, or as a placeholder in a list of alternatives. — Rob Kennedy, Feb 28 '12 at 16:18

Rob Kennedy · Accepted Answer · 2012-02-28T16:15:44.577

The eps identifier, like many of the other identifiers you use, are defined in the qi namespace. The others are brought into the global namespace with the using statements at the top of your constructor. Do the same for eps:

using qi::eps;

Once you resolve that, you have the larger issue of whether you're correctly representing the syntax and grammar of XML. It doesn't look like you have it right. You have this:

xml %= 
      start_tag[_a = _1]
    > attribute
    > (   "/>" > eps
        | ">" > *node > end_tag(_a)
      )
    ;

That can't be right, though. Attributes are part of a tag, not things that follow a tag. It looks like you wanted to break start_tag appart so you could handle empty tags. If I were doing this, I'd probably create an empty_tag rule instead, and then change xml to be empty_tag | (start_tag > *node > end_tag). That's how the W3C language recommendation does it:

[39]  element   ::= EmptyElemTag
                    | STag content ETag

But don't worry about that for now. Remember that your stated task is to add attributes to the parser. Don't get distracted by other missing features. There are plenty of those to work on later.

I mentioned the W3C document. You should refer to that often; it defines the language, and it even shows the grammar. One of the design goals of Spirit was that it should look like a grammar definition. Use that to your advantage by trying to mimic the W3C grammar in your own code. The W3C defines the start tag like this:

[40]  STag      ::= '<' Name (S Attribute)* S? '>'
[41]  Attribute ::= Name Eq AttValue

So write your code like this:

start_tag %=
    // Can't use operator> for "expect" because empty_tag
    // will be the same up to the final line.
       '<'
    >> !lit('/')
    >> name
    >> *attribute
    >> '>'
    ;

name %= ...; // see below

attribute %=
      name
    > '='
    > attribute_value
    ;

The spec defines attributes-value syntax:

[10]  AttValue  ::= '"' ([^<&"] | Reference)* '"'
                    |  "'" ([^<&'] | Reference)* "'"

I wouldn't worry about entity references yet. Like empty tags, your current code already doesn't support them, so it's not important to add them now as part of attributes. That makes attribute_value easy to define:

attribute_value %=
      '"' > *(char_ - char_("<&\"")) > '"'
    | '\'' > *(char_ - char_("<&'")) > '\''
    ;

The name definition doesn't have to be anything fancy yet. It's complicated in the specification because it handles the full Unicode range of characters, but you can start with something simpler and come back to it later, when you figure out how to handle Unicode characters throughout your parser.

name %=
    lexeme[char_("a-zA-Z:_") >> *char_("-a-zA-Z0-9:_")]
    ;

These changes should allow you to parse XML attributes. However, it's another matter to extract the results as Spirit attributes (so you can know the names and values of attributes for a given tag in the rest of your program), and I'm not prepared to discuss that right now.

Many thanks, Rob! Great answer! Yes, you are right, i'm trying to handle empty tags too. Now I understand a little more about Boost.Spirit. It seems it is not so hard to write grammar. — ruslan.berliner, Feb 28 '12 at 18:29

Boost.Spirit, how to extend xml parsing?

1 Answers1