4

I have a huge amount of files I am trying to parse using boost::spirit::qi. Parsing is not a problem, but some of the files contain noise that I want to skip. Building a simple parser (not using boost::spirit::qi) verifies that I can avoid the noise by skipping anything that doesn't match rules at the beginning of a line. So, I'm looking for a way to write a line based parser that skip lines when not matching any rule.

The example below allows the grammar to skip lines if they don't match at all, but the 'junk' rule still inserts an empty instance of V(), which is unwanted behaviour. The use of \r instead of \n in the example is intentional as I have encountered both \n, \r and \r\n in the files.

#include <iostream>
#include <string>
#include <vector>
#include <boost/foreach.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/fusion/include/std_tuple.hpp>

namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
namespace phx = boost::phoenix;

using V = std::tuple<std::string, double, double, double>;

namespace client {
    template <typename Iterator>
    struct VGrammar : qi::grammar<Iterator, std::vector<V>(), ascii::space_type> {
        VGrammar() : VGrammar::base_type(start) {
            using namespace qi;

            v %= string("v") > double_ > double_ > double_;
            junk = +(char_ - eol);
            start %= +(v | junk);

            v.name("v");
            junk.name("junk");
            start.name("start");

            using phx::val;
            using phx::construct;

            on_error<fail>(
                start,
                std::cout
                    << val("Error! Expecting \n\n'")
                    << qi::_4
                    << val("'\n\n here: \n\n'")
                    << construct<std::string>(qi::_3, qi::_2)
                    << val("'")
                    << std::endl
            );

            //debug(v);
            //debug(junk);
            //debug(start);
        }

        qi::rule<Iterator> junk;
        //qi::rule<Iterator, qi::unused_type()> junk; // Doesn't work either
        //qi::rule<Iterator, qi::unused_type(), qi::unused_type()> junk; // Doesn't work either
        qi::rule<Iterator, V(), ascii::space_type> v;
        qi::rule<Iterator, std::vector<V>(), ascii::space_type> start;
    };
} // namespace client

int main(int argc, char* argv[]) {
    using iterator_type = std::string::const_iterator;

    std::string input = "";
    input += "v 1 2 3\r";         // keep v 1 2 3
    input += "o a b c\r";         // parse as junk
    input += "v 4 5 6 v 7 8 9\r"; // keep v 4 5 6, but parse v 7 8 9 as junk
    input += "   v 10 11 12\r\r"; // parse as junk

    iterator_type iter = input.begin();
    const iterator_type end = input.end();
    std::vector<V> parsed_output;
    client::VGrammar<iterator_type> v_grammar;

    std::cout << "run" << std::endl;
    bool r = phrase_parse(iter, end, v_grammar, ascii::space, parsed_output);
    std::cout << "done ... r: " << (r ? "true" : "false") << ", iter==end: " << ((iter == end) ? "true" : "false") << std::endl;

    if (r && (iter == end)) {
        BOOST_FOREACH(V const& v_row, parsed_output) {
            std::cout << std::get<0>(v_row) << ", " << std::get<1>(v_row) << ", " << std::get<2>(v_row) << ", " << std::get<3>(v_row) << std::endl;
        }
    }

    return EXIT_SUCCESS;
}

Here's the output from the example:

run
done ... r: true, iter==end: true
v, 1, 2, 3
, 0, 0, 0
v, 4, 5, 6
v, 7, 8, 9
v, 10, 11, 12

And here is what I actually want the parser to return.

run
done ... r: true, iter==end: true
v, 1, 2, 3
v, 4, 5, 6

My main problem right now is to keep the 'junk' rule from adding an empty V() object. How do I accomplish this? Or am I overthinking the problem?

I have tried adding lit(junk) to the start rule, since lit() doesn't return anything, but this will not compile. It fails with: "static assertion failed: error_invalid_expression".

I have also tried to set the semantic action on the junk rule to qi::unused_type() but the rule still creates an empty V() in that case.

I am aware of the following questions, but they don't address this particular issue. I have tried out the comment skipper earlier, but it looks like I'll have to reimplement all the parse rules in the skipper in order to identify noise. My example is inspired by the solution in the last link:

How to skip line/block/nested-block comments in Boost.Spirit?

How to parse entries followed by semicolon or newline (boost::spirit)?

Version info:

Linux debian 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
g++ (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
#define BOOST_VERSION 106200

and:

Linux raspberrypi 4.14.24-v7+ #1097 SMP Mon Mar 5 16:42:05 GMT 2018 armv7l GNU/Linux
g++ (Raspbian 4.9.2-10+deb8u1) 4.9.2
#define BOOST_VERSION 106200 

For those who wonder: yes I'm trying to parse files similar to Wavefront OBJ files and I'm aware that there is already a bunch of parsers available. However, the data I'm parsing is part of a larger data structure which also requires parsing, so it does make sense to build a new parser.

Yumi Onei
  • 43
  • 4
  • have a look here https://stackoverflow.com/questions/49262634/slow-performance-using-boost-xpressive/49312362#49312362 where I remember talking about skipping uninteresting lines – sehe Aug 24 '18 at 20:48
  • Also relevant: https://stackoverflow.com/questions/17661026/parsing-into-several-vector-members/17664148#17664148 (wavefront Qi parser), and specifically about ignoring comments in Wavefront OBJ: https://stackoverflow.com/questions/20842508/obj-parser-with-boost-spirit-ignoring-comments/20844486#20844486 – sehe Aug 24 '18 at 20:57
  • And there was that time where I implemented the Material parsing from specs: https://stackoverflow.com/questions/20443391/its-a-good-idea-to-use-boostprogram-options-to-parse-a-text-file/20448756#20448756 – sehe Aug 24 '18 at 21:01

2 Answers2

3

What you are wanting to achieve is called error recover.

Unfortunately, Spirit does not have a nice way of doing it (there are also some internal decisions which makes it hard to make it externally). However, in your case it is simple to achieve by grammar rewrite.

#include <iostream>
#include <string>
#include <vector>
#include <boost/foreach.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/fusion/include/std_tuple.hpp>

namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
namespace phx = boost::phoenix;

using V = std::tuple<std::string, double, double, double>;

namespace client {
    template <typename Iterator>
    struct VGrammar : qi::grammar<Iterator, std::vector<V>()> {
        VGrammar() : VGrammar::base_type(start) {
            using namespace qi;

            v = skip(blank)[no_skip[string("v")] > double_ > double_ > double_];
            junk = +(char_ - eol);
            start = (v || -junk) % eol;

            v.name("v");
            junk.name("junk");
            start.name("start");

            using phx::val;
            using phx::construct;

            on_error<fail>(
                start,
                std::cout
                << val("Error! Expecting \n\n'")
                << qi::_4
                << val("'\n\n here: \n\n'")
                << construct<std::string>(qi::_3, qi::_2)
                << val("'")
                << std::endl
                );

            //debug(v);
            //debug(junk);
            //debug(start);
        }

        qi::rule<Iterator> junk;
        //qi::rule<Iterator, qi::unused_type()> junk; // Doesn't work either
        //qi::rule<Iterator, qi::unused_type(), qi::unused_type()> junk; // Doesn't work either
        qi::rule<Iterator, V()> v;
        qi::rule<Iterator, std::vector<V>()> start;
    };
} // namespace client

int main(int argc, char* argv[]) {
    using iterator_type = std::string::const_iterator;

    std::string input = "";
    input += "v 1 2 3\r";         // keep v 1 2 3
    input += "o a b c\r";         // parse as junk
    input += "v 4 5 6 v 7 8 9\r"; // keep v 4 5 6, but parse v 7 8 9 as junk
    input += "   v 10 11 12\r\r"; // parse as junk

    iterator_type iter = input.begin();
    const iterator_type end = input.end();
    std::vector<V> parsed_output;
    client::VGrammar<iterator_type> v_grammar;

    std::cout << "run" << std::endl;
    bool r = parse(iter, end, v_grammar, parsed_output);
    std::cout << "done ... r: " << (r ? "true" : "false") << ", iter==end: " << ((iter == end) ? "true" : "false") << std::endl;

    if (r && (iter == end)) {
        BOOST_FOREACH(V const& v_row, parsed_output) {
            std::cout << std::get<0>(v_row) << ", " << std::get<1>(v_row) << ", " << std::get<2>(v_row) << ", " << std::get<3>(v_row) << std::endl;
        }
    }

    return EXIT_SUCCESS;
}
Nikita Kniazev
  • 3,728
  • 2
  • 16
  • 30
  • I think OP wants to avoid semantic actions. Also, can you clarify in which way you think OP is trying to do "error recovery"? – sehe Aug 25 '18 at 23:25
  • 1
    I think you've misunderstood the topic name. Moreover, since it says "I have a huge amount of files I am trying to parse" it is better to avoid unnecessary backtracking. What about error recovery, skipping "junk" in parsing is called error recovering, I do not understand what type of clarification you are asking. – Nikita Kniazev Aug 26 '18 at 12:57
  • Fair point about backtracking. If the performance is an issue, the complexity of using semantic actions/customized attribute propagation might be warranted. +1 then :) – sehe Aug 26 '18 at 13:29
  • I didn't think of "skipping junk" as error recovering (just as a sloppy grammar). But I understand now what you meant (if you consider the grammar "complete" and any non-conformant input as "deficient" then indeed skipping junk can be seen as error recovery.) Other parser generators have "resync points" or similar for tasks like that. – sehe Aug 26 '18 at 13:31
  • Oops. I was misguided by `x3` knowledge, `qi` is smarter and it does not append default constructed elements if underlying parser does not generate a value. So grammar is working without semantic actions https://wandbox.org/permlink/N3wYV4UEqAHnG5IN – Nikita Kniazev Aug 26 '18 at 21:20
  • Geez. I totally forgot about `operator||` as well. Thanks for sticking with it! – sehe Aug 26 '18 at 21:36
  • 1
    It looks like sequential-or is barely used by anyone. It contained UB all the time until Boost 1.67 https://github.com/boostorg/spirit/pull/310 – Nikita Kniazev Aug 26 '18 at 22:05
  • Anyhoops, X3 works completely analogous, AFAICT here (of course not using the sequential-or): http://coliru.stacked-crooked.com/a/1166ce3dec849891 – sehe Aug 26 '18 at 22:06
  • Indeed, interesting. However a simple case gives this https://wandbox.org/permlink/sk31FWgMeJ9UlKhK >_ – Nikita Kniazev Aug 26 '18 at 22:26
  • Haha. We actively trying to break things now are we :) Still the same in Qi . The critical difference appears to be at what time the `unused_type` is "boiled off". If it happens late, we get the empties. If it happens first, the Kleene element type still matches the `optional` pattern so conditional push_back can take place. I've developed a pretty good sense for just how much the attribute propagation rules will "bend" over years of experimenting and transpiration. I'm okay with that. Let's use Rust or Haskell if we need better language support. ANTLR or similar for pro parsers in C++ – sehe Aug 26 '18 at 22:41
  • I have opened an [issue #394](https://github.com/boostorg/spirit/issues/394). If you have other repros, please add them. The bugs should be fixed, or at least reported. – Nikita Kniazev Aug 26 '18 at 23:03
  • I honestly never considered these bugs (or a library like Spirit would probably never be suited for production). If any library offers features based on convention or heuristics, I go with the flow, adapt to the library instead of vice versa. In that respect I consider my self more like a crafts man than a mathematician. Meanwhile I think I've been one of Spirits biggest/must active supporters so I'm happy to leave the other roles to others. Thanks for stepping up – sehe Aug 26 '18 at 23:10
  • Inconsistency obviously is a bug and these thing actually prevents production usage of a library. In case of Spirit, there are a lot of caveats which actually bugs, they quickly enough become PITA and turn away newcomers. – Nikita Kniazev Aug 26 '18 at 23:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/178830/discussion-between-sehe-and-nikita-kniazev). – sehe Aug 27 '18 at 07:00
  • Yes, this is kind of an attempt at error recovery. Some of the files I have are edited or created with at tool that forgets to add 0x0 at the end of char[], so real data is followed by a dump of random memory, sometimes containing tokens recognized by the parser, but not at the beginning of a line. There is no meaningful way to identify this except that it doesn't match a known combination of tokens. – Yumi Onei Aug 28 '18 at 04:04
  • As I mentioned for sehe's solution, I accept your solution since I only have to add new rules to the start rule. Otherwise, both solutions work just fine. Thank you for your help. – Yumi Onei Aug 28 '18 at 07:36
3

I have tried adding lit(junk) to the start rule, since lit() doesn't return anything, but this will not compile. It fails with: "static assertion failed: error_invalid_expression".

What you're looking for would be omit[junk], but it should make no difference because it will still make the synthesized attribute optional<>.

Fixing Things

  1. First of all, you need newlines to be significant. Which means you cannot skip space. Because it eats newlines. What's worse, you need leading whitespace to be significant as well (to junk that last line, e.g.). You cannot even use qi::blank for the skipper then. (See Boost spirit skipper issues).

    Just so you can still have whitespace inside the v rule, just have a local skipper (that doesn't eat newlines):

    v %= &lit("v") >> skip(blank) [ string("v") > double_ > double_ > double_ ];
    

    It engages the skipper only after establishing that there was no unexpected leading whitespace.

    Note that the string("v") is a bit redundant this way, but that brings us to the second motive:

  2. Second of all, I'm with you in avoiding semantic actions. However, this means you have to make your rules reflect your data structures.

    In this particular instance, it means you should probably turn the line skipping a bit inside-out. What if you express the grammar as a straight repeat of v, interspersed with /whatever/, instead of just /newline/? I'd write that like:

    junk = *(char_ - eol);
    other = !v >> junk;
    
    start = *(v >> junk >> eol % other);
    

    Note that

    • the delimiter expression now uses the operator% (list operator) itself: (eol % other). What this cleverly accomplishes is that it keeps eating newlines as long as they are only delimited by "other" lines (anything !v at this point).
    • other is more constrained than junk, because junk may eat v, whereas other makes sure that never happens
    • therefore v >> junk allows the third line of your sample to be correctly processed (the line that has v 4 5 6 v 7 8 9\r)

Now it all works: Live On Coliru:

run
done ... r: true, iter==end: true
v, 1, 2, 3
v, 4, 5, 6

Perfecting It

You might be aware of the fact that this does not handle the case when the first line(s) are not v lines. Let's add that case to the sample and make sure it works as well:

Live On Coliru:

//#define BOOST_SPIRIT_DEBUG
#include <iostream>
#include <string>
#include <vector>
#include <boost/foreach.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/fusion/include/std_tuple.hpp>

namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;

using V = std::tuple<std::string, double, double, double>;

namespace client {
    template <typename Iterator>
    struct VGrammar : qi::grammar<Iterator, std::vector<V>()> {
        VGrammar() : VGrammar::base_type(start) {
            using namespace qi;

            v %= &lit("v") >> skip(blank) [ string("v") > double_ > double_ > double_ ];
            junk = *(char_ - eol);
            other = !v >> junk;

            start = 
                other >> eol % other >>
                *(v >> junk >> eol % other);

            BOOST_SPIRIT_DEBUG_NODES((v)(junk)(start))

            on_error<fail>(
                start,
                std::cout
                    << phx::val("Error! Expecting \n\n'") << qi::_4
                    << "'\n\n here: \n\n'" << phx::construct<std::string>(qi::_3, qi::_2)
                    << "'\n"
            );
        }

      private:
        qi::rule<Iterator> other, junk;
        qi::rule<Iterator, V()> v;
        qi::rule<Iterator, std::vector<V>()> start;
    };
} // namespace client

int main() {
    using iterator_type = std::string::const_iterator;

    std::string input = "";
    input += "o a b c\r";         // parse as junk
    input += "v 1 2 3\r";         // keep v 1 2 3
    input += "o a b c\r";         // parse as junk
    input += "v 4 5 6 v 7 8 9\r"; // keep v 4 5 6, but parse v 7 8 9 as junk
    input += "   v 10 11 12\r\r"; // parse as junk

    iterator_type iter = input.begin();
    const iterator_type end = input.end();
    std::vector<V> parsed_output;
    client::VGrammar<iterator_type> v_grammar;

    std::cout << "run" << std::endl;
    bool r = parse(iter, end, v_grammar, parsed_output);
    std::cout << "done ... r: " << (r ? "true" : "false") << ", iter==end: " << ((iter == end) ? "true" : "false") << std::endl;

    if (iter != end)
        std::cout << "Remaining unparsed: '" << std::string(iter, end) << "'\n";

    if (r) {
        BOOST_FOREACH(V const& v_row, parsed_output) {
            std::cout << std::get<0>(v_row) << ", " << std::get<1>(v_row) << ", " << std::get<2>(v_row) << ", " << std::get<3>(v_row) << std::endl;
        }
    }

    return EXIT_SUCCESS;
}
sehe
  • 374,641
  • 47
  • 450
  • 633
  • There is no difference between `junk` and `omit[junk]`, `junk` generates attribute of `unused_type`. The actual problem asked in the topic is "why the Spirit appends default constructed attribute to a container instead of doing nothing when underlying parser attribute is of `unused_type`". – Nikita Kniazev Aug 26 '18 at 13:08
  • @NikitaKniazev Thanks for letting me know :) Obviously I was just responding, in the most literal and direct sense, to the (quoted) observation that `lit(junk)` didn't compile. As you can see in the rest of my answer I don't even give it any more mention, because indeed all the different spellings of the `unused_type` attribute declaration are equivalent, and that's indeed not what the question is about. My answer is, incidentally, about what you mentioned: how to formulate the parser in such a way that it _does_ match the desired bound output attribute. – sehe Aug 26 '18 at 13:26
  • You've earned my +1 with this bizarre rule that for some reason Spirit accepts. However I would not suggest anyone to actually have something like this in their codebase :) – Nikita Kniazev Aug 26 '18 at 13:45
  • It's pretty obvious why it is accepted (like I explain it just matches the data structure). Yeah, the "bizarre" part is where the delimiter does negative look-aheads, but that's pretty common (though a lot more common in lexing/scanning I guess) – sehe Aug 26 '18 at 13:57
  • Thank your explanations and for the addtional links. I can see that I have a long way to go in order to grasp the nuances of spirit::qi. However, your example works, but doesn't work with the original data. You added "o a b c" to the beginning of the input in your example. If I remove that line, the parser fails. Of course, most OBJ files have mtllib or something similar before vertex info. – Yumi Onei Aug 28 '18 at 03:51
  • I have two live sample links. First shows correct parsing of the original. The other answerer tik time to arrive at the best solution I suppose. I'd just go with that. Cheers – sehe Aug 28 '18 at 06:32
  • @sehe. If I change "other >> eol % other >>" to "*(other >> eol % other >>)" in the last example, it works fine with the original data. – Yumi Onei Aug 28 '18 at 07:20
  • Both your and Nikita's solutions solve the problem. I'll accept Nikita's solution because it looks slightly cleaner when I start adding additional rules (vt, vn and such). I have tried both solutions by adding vt and vn, returning a std::vector> and they both work as expected. While your solution comes with more additional information, I have to repeat all recognized rules in both start and other rules. I have not tested the speed of either solution. Thank you for all the additional links. – Yumi Onei Aug 28 '18 at 07:31