1

I am playing with parsing a Newick tree format using Boost Spirit x3, and I fail at parsing a complete tree.

Minimal reproducible example

This is my attempted solution:

namespace quetzal::newick::parser
{
  namespace x3 = boost::spirit::x3;

  using x3::alpha;
  using x3::alnum;
  using x3::double_;
  using x3::rule;
  using x3::lit;

  rule<struct branch> branch{"branch"};

  auto name    = alpha >> *alnum; // to be improved later
  auto length  = ':' >> double_;
  auto leaf    = -name;
  auto internal= '(' >> (branch % ',') >> ')' >> -name;
  auto subtree = leaf | internal;
  auto tree    = subtree >> ';';

  auto const branch_def = subtree >> -length;

  BOOST_SPIRIT_DEFINE(branch);
}

The tests for parsing the internal grammar seems to work

BOOST_AUTO_TEST_CASE(internal_grammar)
{
  std::vector<std::string> inputs =
  {
    "(,)",
    "(A,B)F",
    "(A:10,B:10)F"
  };

  for(const auto& input : inputs)
  {
    auto iter = input.begin();
    auto iter_end = input.end();
    bool r = phrase_parse(iter, iter_end, quetzal::newick::parser::internal, x3::space );
    BOOST_CHECK(r && iter == iter_end);
  }
}

But the full parser tree fails to parse all but the first tree, and I don't understand why:

BOOST_AUTO_TEST_CASE(full_grammar)
{
  std::vector<std::string> inputs =
  {
    ";",
    "(,);",
    "(,,(,));",
    "(A,B,(C,D));",
    "(A,B,(C,D)E)F;",
    "(:0.1,:0.2,(:0.3,:0.4):0.5);",
    "(:0.1,:0.2,(:0.3,:0.4):0.5):0.0;",
    "(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);",
    "(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;",
    "((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;"
  };

  for(const auto& input : inputs)
  {
    auto iter = input.begin();
    auto iter_end = input.end();
    bool r = phrase_parse(iter, iter_end, quetzal::newick::parser::tree, x3::space );
    BOOST_CHECK(r && iter == iter_end);
  }
}

Possible shortcomings

  • I was thinking it may be because of my misuse/nonuse of x3::lit, but this question seems to clear it
  • The parser possibly fails are detecting the recursion, but I don't know what in my grammar definition would be faulty for this. I am aware that we should use auto only for non-recursive rules (from cppcon presentation by Michael Caisse but I was hoping I made a proper use of x3::rule here for the recursive rule.
  • Last possible caveat: maybe that the grammar fails to detect empty nodes. I am aware of null parser but it's unclear to me if I should use it here (optional and a possibly empty list should do the trick, right?).
WaterFox
  • 850
  • 6
  • 18
  • 1
    I like how you are learning this very much the right way. Kudos for describing your thought process as well. It saves me from explaining things that you already know and allows me to target my answer to your current (and desired) level of understanding. – sehe Nov 15 '22 at 22:11
  • Awww thank you @sehe it means the world to me <3 Happy to not completely waste your time! :D – WaterFox Nov 15 '22 at 22:21

1 Answers1

2

I made it self-contained: Live On Coliru

Now, when you want to understand the X3 grammar - beyond mentally debugging - you can enable

#define BOOST_SPIRIT_X3_DEBUG

This debugs rules. Consider adding some debug-only rules for more detailed info:

auto dbg(auto name, auto p) { return x3::rule<struct _>{name} = p; };

auto name     = dbg("name",     x3::alpha >> *x3::alnum); // to be improved later
auto length   = dbg("length",   ':' >> x3::double_);
auto leaf     = dbg("leaf",     -name);
auto internal = dbg("internal", '(' >> (branch % ',') >> ')' >> -name);
auto subtree  = dbg("subtree",  leaf | internal);
auto tree     = dbg("tree",     subtree >> ';');

Now the output will be e.g.: Live

<tree>
  <try>;</try>
  <subtree>
    <try>;</try>
    <leaf>
      <try>;</try>
      <name>
        <try>;</try>
        <fail/>
      </name>
      <success>;</success>
    </leaf>
    <success>;</success>
  </subtree>
  <success></success>
</tree>
";" -> true true

You can "trace" the rule invocations and results. Now, let's look at the first fail:

<tree>
  <try>(,);</try>
  <subtree>
    <try>(,);</try>
    <leaf>
      <try>(,);</try>
      <name>
        <try>(,);</try>
        <fail/>
      </name>
      <success>(,);</success>
    </leaf>
    <success>(,);</success>
  </subtree>
  <fail/>
</tree>
"(,);" -> false false

You can see it tries subtree, which tries leaf, which succeeds because leaf is optional by definition:

 auto leaf = -name;

A parser shaped -p will always succeed. So in a|b when a = -p, the alternative b will never be invoked. Either make name less optional, or reorder your branches, so an internal will get a chance before deciding an empty leaf was matched:

auto subtree  = internal | leaf;

Now we get:

void quetzal::newick::test::tree()
";" -> true true
"(,);" -> true true
"(,,(,));" -> true true
"(A,B,(C,D));" -> true true
"(A,B,(C,D)E)F;" -> true true
"(:0.1,:0.2,(:0.3,:0.4):0.5);" -> true true
"(:0.1,:0.2,(:0.3,:0.4):0.5):0.0;" -> false false
"(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);" -> true true
"(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;" -> true true
"((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;" -> true true

Looking at the one remaining failing parse:

<tree>
  <try>(:0.1,:0.2,(:0.3,:0.</try>
  <subtree>
    <try>(:0.1,:0.2,(:0.3,:0.</try>
    <internal>
      <try>(:0.1,:0.2,(:0.3,:0.</try>
      <branch>
        <try>:0.1,:0.2,(:0.3,:0.4</try>
        <subtree>
          <try>:0.1,:0.2,(:0.3,:0.4</try>
          <internal>
            <try>:0.1,:0.2,(:0.3,:0.4</try>
            <fail/>
          </internal>
          <leaf>
            <try>:0.1,:0.2,(:0.3,:0.4</try>
            <name>
              <try>:0.1,:0.2,(:0.3,:0.4</try>
              <fail/>
            </name>
            <success>:0.1,:0.2,(:0.3,:0.4</success>
          </leaf>
          <success>:0.1,:0.2,(:0.3,:0.4</success>
        </subtree>
        <length>
          <try>:0.1,:0.2,(:0.3,:0.4</try>
          <success>,:0.2,(:0.3,:0.4):0.</success>
        </length>
        <success>,:0.2,(:0.3,:0.4):0.</success>
      </branch>
      <branch>
        <try>:0.2,(:0.3,:0.4):0.5</try>
        <subtree>
          <try>:0.2,(:0.3,:0.4):0.5</try>
          <internal>
            <try>:0.2,(:0.3,:0.4):0.5</try>
            <fail/>
          </internal>
          <leaf>
            <try>:0.2,(:0.3,:0.4):0.5</try>
            <name>
              <try>:0.2,(:0.3,:0.4):0.5</try>
              <fail/>
            </name>
            <success>:0.2,(:0.3,:0.4):0.5</success>
          </leaf>
          <success>:0.2,(:0.3,:0.4):0.5</success>
        </subtree>
        <length>
          <try>:0.2,(:0.3,:0.4):0.5</try>
          <success>,(:0.3,:0.4):0.5):0.</success>
        </length>
        <success>,(:0.3,:0.4):0.5):0.</success>
      </branch>
      <branch>
        <try>(:0.3,:0.4):0.5):0.0</try>
        <subtree>
          <try>(:0.3,:0.4):0.5):0.0</try>
          <internal>
            <try>(:0.3,:0.4):0.5):0.0</try>
            <branch>
              <try>:0.3,:0.4):0.5):0.0;</try>
              <subtree>
                <try>:0.3,:0.4):0.5):0.0;</try>
                <internal>
                  <try>:0.3,:0.4):0.5):0.0;</try>
                  <fail/>
                </internal>
                <leaf>
                  <try>:0.3,:0.4):0.5):0.0;</try>
                  <name>
                    <try>:0.3,:0.4):0.5):0.0;</try>
                    <fail/>
                  </name>
                  <success>:0.3,:0.4):0.5):0.0;</success>
                </leaf>
                <success>:0.3,:0.4):0.5):0.0;</success>
              </subtree>
              <length>
                <try>:0.3,:0.4):0.5):0.0;</try>
                <success>,:0.4):0.5):0.0;</success>
              </length>
              <success>,:0.4):0.5):0.0;</success>
            </branch>
            <branch>
              <try>:0.4):0.5):0.0;</try>
              <subtree>
                <try>:0.4):0.5):0.0;</try>
                <internal>
                  <try>:0.4):0.5):0.0;</try>
                  <fail/>
                </internal>
                <leaf>
                  <try>:0.4):0.5):0.0;</try>
                  <name>
                    <try>:0.4):0.5):0.0;</try>
                    <fail/>
                  </name>
                  <success>:0.4):0.5):0.0;</success>
                </leaf>
                <success>:0.4):0.5):0.0;</success>
              </subtree>
              <length>
                <try>:0.4):0.5):0.0;</try>
                <success>):0.5):0.0;</success>
              </length>
              <success>):0.5):0.0;</success>
            </branch>
            <name>
              <try>:0.5):0.0;</try>
              <fail/>
            </name>
            <success>:0.5):0.0;</success>
          </internal>
          <success>:0.5):0.0;</success>
        </subtree>
        <length>
          <try>:0.5):0.0;</try>
          <success>):0.0;</success>
        </length>
        <success>):0.0;</success>
      </branch>
      <name>
        <try>:0.0;</try>
        <fail/>
      </name>
      <success>:0.0;</success>
    </internal>
    <success>:0.0;</success>
  </subtree>
  <fail/>
</tree>
"(:0.1,:0.2,(:0.3,:0.4):0.5):0.0;" -> false false

Looking at the end clearly indicates the problem is that the length (":0.0") is encountered outside the last parentheses, where it is not expected. Perhaps you forgot that you were using tree as the rule, not branch? Anyways, you can probably take it from here.

Side Notes

You are using a skipper which will probably make your life unless you make some rules lexeme (like name). I'd also suggest coding the skipper into your grammar:

auto tree = x3::skip(x3::space) [ subtree >> ';' ];

Note that space includes newlines, so maybe you really want blank instead. Finally, you can incorporate the f == l iterator check into the grammar by appending >> eoi:

auto tree = x3::skip(x3::space) [ subtree >> ';' >> x3::eoi ];

Full Listing

Also addressing the side notes and removing the debug/expositional stuff:

Live On Coliru

#include <boost/spirit/home/x3.hpp>
#include <iomanip>
#include <iostream>
namespace x3 = boost::spirit::x3;

namespace quetzal::newick::parser {

    x3::rule<struct branch> branch{"branch"};

    auto name     = x3::lexeme[x3::alpha >> *x3::alnum]; // to be improved later
    auto length   = ':' >> x3::double_;
    auto leaf     = -name;
    auto internal = '(' >> (branch % ',') >> ')' >> -name;
    auto subtree  = internal | leaf;
    auto tree     = x3::skip(x3::blank)[subtree >> ';' >> x3::eoi];

    auto branch_def = subtree >> -length;
    BOOST_SPIRIT_DEFINE(branch)
} // namespace quetzal::newick::parser
  
namespace quetzal::newick::test {
    void run_tests(auto name, auto p, std::initializer_list<char const*> cases) {
        std::cerr << "============ running " << name << " tests:\n";
        for (std::string const input : cases)
            std::cout << quoted(input) << " \t-> " << std::boolalpha
                      << parse(begin(input), end(input), p) << std::endl;
    }

    void internal() {
        run_tests("internal", quetzal::newick::parser::internal,
            {
                "(,)",
                "(A,B)F",
                "(A:10,B:10)F",
            });
    }

    void tree() {
        run_tests("tree", quetzal::newick::parser::tree,
            {
                ";",
                "(,);",
                "(,,(,));",
                "(A,B,(C,D));",
                "(A,B,(C,D)E)F;",
                "(:0.1,:0.2,(:0.3,:0.4):0.5);",
                "(:0.1,:0.2,(:0.3,:0.4):0.5):0.0;",
                "(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);",
                "(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;",
                "((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;",
            });
    }
} // namespace quetzal::newick::test

int main() {
    using namespace quetzal::newick::test;
    internal();
    tree();
}

Prints

============ running internal tests:
"(,)"   -> true
"(A,B)F"    -> true
"(A:10,B:10)F"  -> true
============ running tree tests:
";"     -> true
"(,);"  -> true
"(,,(,));"  -> true
"(A,B,(C,D));"  -> true
"(A,B,(C,D)E)F;"    -> true
"(:0.1,:0.2,(:0.3,:0.4):0.5);"  -> true
"(:0.1,:0.2,(:0.3,:0.4):0.5):0.0;"  -> false
"(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);"  -> true
"(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;"    -> true
"((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;"   -> true
sehe
  • 374,641
  • 47
  • 450
  • 633
  • 1
    Added a [full version](http://coliru.stacked-crooked.com/a/e81675bf38de1d7f) addressing the side notes and cleaning up the expositional stuff – sehe Nov 15 '22 at 22:24
  • woh woh wooooh! I have no words, but a big thank you! So the root of my problem was the optional, it makes so much sense now. From what I see in your code, I should probably stop using `using x3::lexeme;` directives. Thanks for the free debug tips, it's precious advise! – WaterFox Nov 15 '22 at 22:37
  • 1
    I'm not religious about `using` directives, but for my own code I'll usually use `using namespace x3;` very locally. For SO answers I've done that, but I've started using `x3::` qualification more because I find it's not harming readability and it makes it a lot clearer for new X3 users. In fact, it gives incentive to keep rules short :) – sehe Nov 15 '22 at 22:40