2

I'm writing a parser in Spirit X3 in order to get familiar with it, and even though I'm pretty familiar Qi I'm still hitting some stumbling blocks in X3.

For example, the Qi examples include a basic XML parser that should you how to match a previously matched value using Phoenix placeholders. However, I've only kinda been able to figure it out in X3:

#include <iostream>
#include <boost/spirit/home/x3.hpp>
#include <boost/fusion/include/adapt_struct.hpp>

namespace x3 = boost::spirit::x3;

namespace mytest
{

struct SimpleElement
{
    std::string tag;
    std::string content;
};

} // namespace bbspirit

BOOST_FUSION_ADAPT_STRUCT
(
    mytest::SimpleElement, tag, content
)

namespace mytest
{

namespace x3 = boost::spirit::x3;
namespace ascii = boost::spirit::x3::ascii;

using x3::lit;
using x3::lexeme;
using ascii::char_;

const x3::rule<class SimpleElementID, SimpleElement> simpleTag = "simpleTag";

auto assignTag = [](auto& ctx)
{
    x3::_val(ctx).tag = x3::_attr(ctx);
};

auto testTag = [](auto& ctx)
{
    x3::_pass(ctx) = 
        (x3::_val(ctx).tag == x3::_attr(ctx));
};

auto assignContent = [](auto& ctx)
{
    x3::_val(ctx).content = x3::_attr(ctx);
};

auto const simpleTag_def
    = '['
    >> x3::lexeme[+(char_ - ']')][assignTag]
    >> ']'
    >> x3::lexeme[
        +(char_ - x3::lit("[/"))]
            [assignContent]
    >> "[/"
    >> x3::lexeme[+(char_ - ']')][testTag]
    >> ']'
    ;

BOOST_SPIRIT_DEFINE(simpleTag);

} // namespace bbspirit


int main() 
{

const std::string text = "[test]Hello World![/test]";
std::string::const_iterator start = std::begin(text);
const std::string::const_iterator stop = std::end(text);

mytest::SimpleElement element{};

bool result = 
    phrase_parse(start, stop, mytest::simpleTag, x3::ascii::space, element);

if (!result)
{
    std::cout << "failed to parse!\n";
}
else
{
    std::cout << "tag    : " << element.tag << '\n';
    std::cout << "content: " << element.content << '\n';
}

}

(Link: https://wandbox.org/permlink/xLZN9plcOwkSKCrD )

This works, however if I try to parse something like [test]Hello [/World[/test] it doesn't work because I have not specified the correct omission here:

    >> x3::lexeme[
        +(char_ - x3::lit("[/"))]
            [assignContent]

Essentially I want to tell the parser something like:

    >> x3::lexeme[
        +(char_ - (x3::lit("[/")  << *the start tag* << ']') )]
            [assignContent]

How could I go about doing this? Also, is the way in which I'm referencing the start tag and later matching it the "best" way to do this in X3 or is there a better/more preferred way?

Thank you!

Addy
  • 2,414
  • 1
  • 23
  • 43

2 Answers2

1

Nice question.

The best answer would be to do exactly what XML does: outlaw [/ inside the tag data. In fact, XML outlaws < (because it could be opening a nested tag, and you don't want to have to potentially read-ahead the entire stream to find whether it is a valid subtag).

XML uses character entities ("escapes" like &lt; and &gt;) or unparsed character data (CDATA[]) to encode contents that requires these characters.

Next up, you can, of course do a negative lookahead assertion (!closeTag or -closeTag) using the tag attribute member like you already did.

Reshuffling the rule spelling a litte, it's not even that bad

Note I removed the need for manual propagation of the tag/contents using the , true> template argument on simpleTag rule. See Boost Spirit: "Semantic actions are evil"?

const x3::rule<class SimpleElementID, SimpleElement, true> simpleTag = "simpleTag";
auto testTag = [](auto& ctx) { _pass(ctx) = (_val(ctx).tag == _attr(ctx)); };

auto openTag     = '[' >> x3::lexeme[+(char_ - ']')] >> ']';
auto closeTag    = "[/" >> x3::lexeme[+(char_ - ']')] [testTag] >> ']';
auto tagContents = x3::lexeme[ +(char_ - closeTag) ];

auto const simpleTag_def
    =  openTag
    >> tagContents
    >> x3::omit [ closeTag ]
    ;

See it Live On Coliru

Background

That works but ends up getting quite clumsy, because it means using semantic actions all around and also go against the natural binding of attribute references.

Thinking outside the box a litte:

In Qi you'd use qi::locals or inherited attributes for this (see a very similar example in the docs: MiniXML).

Both of these would have the net effect of extending the parser context with your piece(s) of information.

X3 has no such "high-level" features. But it does have the building block to extend your context: x3::witt<>(data) [ p ].

x3::with

In this simple example it would seem overkill, but at some point you will appreciate how you use extra context in your rules without holding your attribute types hostage:

struct TagName{};
auto openTag
    = x3::rule<struct openTagID, std::string, true> {"openTag"}
    = ('[' >> x3::lexeme[+(char_ - ']')] >> ']')
        [([](auto& ctx) { x3::get<TagName>(ctx) = _attr(ctx); })]
    ;
auto closeTag
    = x3::rule<struct closeTagID, std::string, true> {"closeTag"}
    = ("[/" >> x3::lexeme[+(char_ - ']')] >> ']')
        [([](auto& ctx) { _pass(ctx) = (x3::get<TagName>(ctx) == _attr(ctx)); })]
    ;
auto tagContents
    = x3::rule<struct openTagID, std::string> {"tagContents"}
    = x3::lexeme[ +(char_ - closeTag) ];

auto const simpleTag
    = x3::rule<class SimpleElementID, SimpleElement, true> {"simpleTag"}
    = x3::with<TagName>(std::string()) [
        openTag
        >> tagContents
        >> x3::omit [ closeTag ]
    ];

See it Live On Coliru


sehe
  • 374,641
  • 47
  • 450
  • 633
  • Extremely useful and perfectly explained. Thank you @sehe! – Addy Jun 02 '20 at 20:58
  • Just realized that regex is probably a nicer tool for this. Here's [using Boost Xpressive](https://godbolt.org/z/6bW-4f), or this [minimized version](https://godbolt.org/z/STrsQj). See also here for speed inidications. I think the nongreedy sybmatch `(s2 = -*_)` is factually more elegant than any of the X3 approaches. (Of course if the larger context needs a parser, X3 is your tool, see also https://stackoverflow.com/questions/49262634/slow-performance-using-boost-xpressive/49312362#49312362 for random perf trivia) – sehe Jun 02 '20 at 23:21
  • Oh sure, for something as simple as what I presented regex would be the way to go. This, though, is just a small part of a larger parser I'm writing. So expect more questions. ;) – Addy Jun 02 '20 at 23:49
  • In your second block of code where you define `auto tagContents` you also have `=x3::rule – Addy Jun 03 '20 at 00:10
  • It's showing that it works if you define the subrules to be strongly typed to std::string attributes, instead of "hovering" under the main rule and enjoying (accidental) access to the entire `SimpleElement` attribute. That approach doesn't scale well and being able to help attribute compatibility by splitting up typed rules is enormously helpful when building more complicated grammars. (Inicidentally, it also highlights that you don't need the DEFINE/DECLARE/INSTANTIATE macros unless you spread rules across TUs and/or use them recursively). – sehe Jun 03 '20 at 00:33
1

Instead of trying to build a ship with string and matches I would suggest to make a tool that is suitable for the work.

#include <boost/spirit/home/x3.hpp>

namespace x3e
{

struct confix_tag {};

namespace x3 = boost::spirit::x3;

template <typename Parser, typename Iterator,
    typename Context, typename RContext>
inline Iterator seek(Parser const& p, Iterator& iter, Iterator const& last,
    Context const& context, RContext& rcontext)
{
    Iterator start = iter;
    for (;; iter = ++start)
        if (p.parse(iter, last, context, rcontext, x3::unused))
            return start;
    return last;
}


template <typename Prefix, typename Subject, typename Postfix>
struct confix_directive : x3::unary_parser<Subject, confix_directive<Prefix, Subject, Postfix>>
{
    typedef x3::unary_parser<Subject, confix_directive<Prefix, Subject, Postfix>> base_type;
    static bool const is_pass_through_unary = true;

    constexpr confix_directive(Prefix const& prefix, Subject const& subject, Postfix const& postfix)
        : base_type(subject),
          prefix(prefix),
          postfix(postfix)
    {
    }

    template <typename Iterator,
        typename Context, typename RContext, typename Attribute>
    bool parse(Iterator& first, Iterator const& last,
        Context const& context, RContext& rcontext, Attribute& attr) const
    {
        auto& confix_val = boost::fusion::at_c<0>(attr);

        Iterator iter = first;
        if (!prefix.parse(iter, last, context, rcontext, confix_val))
            return false;

        Iterator postfix_iter = iter;
        do {
            Iterator postfix_start = x3e::seek(postfix, postfix_iter, last, x3::make_context<confix_tag>(confix_val, context), rcontext);
            if (postfix_start == last)
                return false;

            if (this->subject.parse(iter, postfix_start, context, rcontext, boost::fusion::at_c<1>(attr))) {
                first = postfix_iter;
                return true;
            }
        } while (postfix_iter != last);

        return false;
    }

    Prefix prefix;
    Postfix postfix;
};

template<typename Prefix, typename Postfix>
struct confix_gen
{
    template<typename Subject>
    constexpr confix_directive<
        Prefix, typename x3::extension::as_parser<Subject>::value_type, Postfix>
    operator[](Subject const& subject) const
    {
        return { prefix, as_parser(subject), postfix };
    }

    Prefix prefix;
    Postfix postfix;
};


template <typename Prefix, typename Postfix>
constexpr confix_gen<typename x3::extension::as_parser<Prefix>::value_type,
    typename x3::extension::as_parser<Postfix>::value_type>
confix(Prefix const& prefix, Postfix const& postfix)
{
    return { as_parser(prefix), as_parser(postfix) };
}

struct confix_value_matcher : x3::parser<confix_value_matcher>
{
    typedef x3::unused_type attribute_type;
    static bool const has_attribute = false;

    template <typename Iterator, typename Context, typename RContext>
    static bool parse(Iterator& iter, Iterator const& last,
        Context const& context, RContext&, x3::unused_type)
    {
        x3::skip_over(iter, last, context);
        for (auto const& e : x3::get<confix_tag>(context))
            if (iter == last || e != *iter++)
                return false;
        return true;
    }
};

constexpr confix_value_matcher confix_value{};
}

#include <boost/fusion/include/adapt_struct.hpp>

namespace mytest
{

struct SimpleElement
{
    std::string tag;
    std::string content;
};

} // namespace bbspirit

BOOST_FUSION_ADAPT_STRUCT
(
    mytest::SimpleElement, tag, content
)

#include <iostream>

int main()
{
    namespace x3 = boost::spirit::x3;

    for (auto text : { "[test]Hello World![/test]",
                       "[test]Hello [/World[/test]" }) {
        std::cout << "text   : " << text << '\n';
        auto start = text, stop = text + std::strlen(text);

        mytest::SimpleElement element;

        auto const simpleTag
            = x3e::confix(x3::lexeme['[' >> +~x3::char_(']') >> ']'],
                          x3::lexeme["[/" >> x3e::confix_value >> ']'])
                              [x3::lexeme[*x3::char_]];

        bool result =
            phrase_parse(start, stop, simpleTag, x3::ascii::space, element);

        if (!result) {
            std::cout << "failed to parse!\n";
        }
        else {
            std::cout << "tag    : " << element.tag << '\n';
            std::cout << "content: " << element.content << '\n';
        }
        std::cout << '\n';
    }
}

Output:

text   : [test]Hello World![/test]
tag    : test
content: Hello World!

text   : [test]Hello [/World[/test]
tag    : test
content: Hello [/World

https://wandbox.org/permlink/qxIaQYtgaWdk9Dog

sehe
  • 374,641
  • 47
  • 450
  • 633
Nikita Kniazev
  • 3,728
  • 2
  • 16
  • 30
  • But I wanted `[test]Hello [/World[/test]` to parse. – Addy Jun 03 '20 at 19:21
  • 1
    I have updated the answer with solution, but really you are asking for some weird stuff that leads to catastrophic backtracking (notable examples of it: https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016 https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/ ) – Nikita Kniazev Jun 03 '20 at 23:16
  • I don't see how I'm asking for anything weird. Wanting to match a previous matched portion of some text seems perfectly reasonable. – Addy Jun 04 '20 at 01:14
  • Because of quadratic complexity of such a parser. If I answered your question, please, mark it as an answer, and also update the question with that you want to `[test]Hello [/World[/test]` match. – Nikita Kniazev Jun 04 '20 at 12:01
  • I'm not an expert on measuring complexity but it still seems like the two variations of the parser (one that allows `[/` and one that doesn't) would be of the same algorithmic complexity. I guess it's just a matter of terminology but I wouldn't consider it "weird". – Addy Jun 04 '20 at 16:11
  • If you have additional questions -- file them separately instead of using comments. – Nikita Kniazev Jun 04 '20 at 18:59
  • I like this approach. @NikitaKniazev isn't x3::seek already a thing? I'd use it to [simplify](https://wandbox.org/permlink/Xb0PVeqEDnKawpCQ) (_also adds some edge case tests by having a subject parser that rejects some matches_). I learned "unary_parser". What is `is_pass_through_unary` used for in this context? – sehe Jun 14 '20 at 22:26
  • (From my own understanding it looks as if it shortcuts some attribute-type computations) – sehe Jun 14 '20 at 22:29
  • I just did not come to combining `seek` with `raw`, well done! :-) However, I am worried about creating a temporary parser from user ones, could be costly. The `is_pass_through_unary = true` thing also allows sequence parser to partition through it, and `parse_into_container` should have been passing through it, but it is not =\ – Nikita Kniazev Jun 15 '20 at 00:49