Parsing text including Unicode using spirit crashes with boost-1.78, but OK with boost-1.67, Why?

Question

I wrote the following code which crashes with boost-1.78; While, I replace std::string input = "geo_dip_subdivision:(+国 -民)"; with std::string input = "geo_dip_subdivision:(+1 -2)";, it runs as expected.

Also, it runs as expected with boost-1.67 and std::string input = "geo_dip_subdivision:(+国 -民)";

So, it is a problem related to Unicode. But I don't know what is the problem, and why it seems running as expected in boost-1.67.

Any help?

#include <string.h>
#define BOOST_SPIRIT_UNICODE
#include <boost/phoenix.hpp>
#include <boost/phoenix/operator.hpp>
#include <boost/spirit/include/qi.hpp>


namespace DB {
using std::vector;
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
#define str_pattern (('"' > *(qi::unicode::char_ - ('"')) | "\\\"") > '"')
#define sym_open    (char_('[') | char_('{'))
#define sym_close   (char_(']') | char_('}'))


struct query_tree;
typedef boost::variant<std::string, query_tree> node;

struct query_tree {
    vector<node> must;
    vector<node> must_not;
    vector<node> should;

    query_tree() = default;

    query_tree(int type, query_tree& old, const query_tree& v)
    {
        if (type == 3) {
            assert(old.should.size() == 0 || old.must.size() == 0);
            assert(v.should.size() + v.must.size() + v.must_not.size() == 1);
            if (old.should.size() > 0) {
                must = std::move(old.should);
            } else {
                must = std::move(old.must);
            }
            must_not = std::move(old.must_not);

            if (v.should.size() > 0) {
                must.push_back(v.should[0]);
            } else if (v.must.size() > 0) {
                must.push_back(v.must[0]);
            } else {
                must_not.push_back(v.must_not[0]);
            }
        } else {
            must = std::move(old.must);
            must_not = std::move(old.must_not);
            should = std::move(old.should);
            push_back(type, v);
        }
    }

    query_tree(int type, const std::string& n) { push_back(type, n); }

    template <typename T> void push_back(int type, const T& v)
    {
        if (type == 0) {
            must.push_back(v);
        } else if (type == 1) {
            must_not.push_back(v);
        } else {
            should.push_back(v);
        }
    }

    query_tree(query_tree& old, const query_tree& v)
    {
        must = std::move(old.must);
        for (size_t i = 0; i < v.must.size(); i++) {
            must.push_back(v.must[i]);
        }

        must_not = std::move(old.must_not);
        for (size_t i = 0; i < v.must_not.size(); i++) {
            must_not.push_back(v.must_not[i]);
        }

        should = std::move(old.should);
        for (size_t i = 0; i < v.should.size(); i++) {
            should.push_back(v.should[i]);
        }
    }
};

template <typename It, typename Skipper = qi::space_type>
struct parser : qi::grammar<It, query_tree(), Skipper> {
    parser() : parser::base_type(query)
    {
        using namespace qi;

        part1 = raw[lexeme[*(str_pattern | qi::unicode::char_ - (char_(')') | char_('(')))]];
        part2 = part1[_val = _1] > *(parenthese[_val = _val + _1]) >
                (char_(')')[_val = _val + _1] | part2[_val = _val + _1]);
        parenthese = char_('(')[_val = _1] > part2[_val = _val + _1];
        range = raw[lexeme[sym_open > *(char_ - sym_close) > sym_close]];

        name = raw[lexeme[+(qi::unicode::char_ - (':' | space | ')')) > ':']];
        other_value = raw[lexeme[+(qi::unicode::char_ - space - ')')]];
        string_value = raw[lexeme[str_pattern]];
        field =
            name[_val = _1] > (string_value | parenthese | range | other_value)[_val = _val + _1];

        group = '(' > query > ')';
        must = "+" > (group[_val = _1] | field[_val = phx::construct<query_tree>(0, _1)]);
        must_not = (string("-") | string("NOT")) >
                   (group[_val = _1] | field[_val = phx::construct<query_tree>(0, _1)]);
        should = group[_val = _1] | field[_val = phx::construct<query_tree>(2, _1)];

        expr = (must[_val = phx::construct<query_tree>(0, _val, _1)] |
                must_not[_val = phx::construct<query_tree>(1, _val, _1)] |
                should[_val = phx::construct<query_tree>(2, _val, _1)]);

        And = expr[_val = phx::construct<query_tree>(_val, _1)] >
              *((string("AND") | string("&&")) >
                expr[_val = phx::construct<query_tree>(3, _val, _1)]);
        Or = And[_val = _1] >
             *((string("OR") | string("||")) > And[_val = phx::construct<query_tree>(_val, _1)]);

        query = *(Or[_val = phx::construct<query_tree>(_val, _1)]);
    }

private:
    qi::rule<It, std::string(), Skipper> field, name, string_value, other_value;
    qi::rule<It, std::string(), qi::no_skip_type> parenthese, part1, part2, range;
    qi::rule<It, query_tree(), Skipper> must, must_not, should, query, expr, group, And, Or;
};


std::string parse_from_lucene(std::string& input)
{
    auto f(std::begin(input)), l(std::end(input));
    parser<decltype(f)> p;

    std::string str;
    try {
        query_tree result;
        bool ok = qi::phrase_parse(f, l, p, qi::space, result);
        if (!ok) {
            throw "invalid input: " + input;
        }
    } catch (const qi::expectation_failure<decltype(f)>& e) {
        throw "expectation_failure at '" + std::string(e.first, e.last) + "'\n";
    }
    return str;
}

};


int main()
{
    std::string input = "geo_dip_subdivision:(+国 -民)";
    input = DB::parse_from_lucene(input);
    std::cout << input << std::endl;
    return 0;
}

sehe · Answer 1 · 2022-09-13T07:07:31.133

Mmm. There are many iffy things about the grammar.

over-use of semantic actions (Boost Spirit: "Semantic actions are evil"?)
using qi::no_skip_type as a... skipper?
mixing char_/lit/raw
mixing primitives from unicode and standard encoding
throwing literals
the str_pattern macro had a bug (see fixed below)
the logic to parse open/close braces seems buggy since it doesn't check matching pairs
parsing into a query_tree, but returning a default-constructed str - the output will always be empty?
note: you don't check that the full input is parsed; partial parses may lead to surprising errors
note: some logic in the type==3 constructor seems buggy (should becomes must?)

If you gave me a reference to the grammar documentation and a list of examples that you expect to parse, I think I'd be able to simplify this to about half the code while removing these errors.

Here's an initial set of refactorings I see that were necessary:

Live On Coliru

#define BOOST_SPIRIT_UNICODE
#include <string.h>
#include <boost/phoenix.hpp>
#include <boost/phoenix/operator.hpp>
#include <boost/spirit/include/qi.hpp>


namespace DB {
namespace qi       = boost::spirit::qi;
namespace phx      = boost::phoenix;
namespace encoding = qi::unicode;
#define str_pattern ('"' > *("\\\"" | encoding::char_ - ('"')) > '"') // BUG
#define sym_open (encoding::char_("[{"))                              // BUG
#define sym_close (encoding::char_("]}"))                             // BUG

struct query_tree;
using String = std::u32string;
using node   = boost::variant<String, query_tree>;

struct query_tree {
    std::vector<node> must;
    std::vector<node> must_not;
    std::vector<node> should;

    query_tree() = default;

    query_tree(int type, query_tree& old, const query_tree& v) {
        if (type == 3) {
            assert(old.should.size() == 0 || old.must.size() == 0);
            assert(v.should.size() + v.must.size() + v.must_not.size() == 1);
            if (old.should.size() > 0) {
                must = std::move(old.should); // PROBABLY BUG?
            } else {
                must = std::move(old.must);
            }
            must_not = std::move(old.must_not);

            if (v.should.size() > 0) {
                must.push_back(v.should[0]); // PROBABLY BUG?
            } else if (v.must.size() > 0) {
                must.push_back(v.must[0]);
            } else {
                must_not.push_back(v.must_not[0]);
            }
        } else {
            must     = std::move(old.must);
            must_not = std::move(old.must_not);
            should   = std::move(old.should);
            push_back(type, v);
        }
    }

    query_tree(int type, const String& n) { push_back(type, n); }

    template <typename T> void push_back(int type, const T& v) {
        if (type == 0) {
            must.push_back(v);
        } else if (type == 1) {
            must_not.push_back(v);
        } else {
            should.push_back(v);
        }
    }

    query_tree(query_tree& old, const query_tree& v) {
        must = std::move(old.must);
        for (size_t i = 0; i < v.must.size(); i++) {
            must.push_back(v.must[i]);
        }

        must_not = std::move(old.must_not);
        for (size_t i = 0; i < v.must_not.size(); i++) {
            must_not.push_back(v.must_not[i]);
        }

        should = std::move(old.should);
        for (size_t i = 0; i < v.should.size(); i++) {
            should.push_back(v.should[i]);
        }
    }
};

template <typename It> struct parser : qi::grammar<It, query_tree()> {
    parser() : parser::base_type(start) {
        using encoding::char_;
        using encoding::space;
        using encoding::string;
        using qi::lexeme;
        using qi::raw;
        using namespace qi::labels;

        auto SET = boost::proto::deep_copy(_val = _1);
        auto ADD = boost::proto::deep_copy(_val = _val + _1);
#define TREE(...) _val = phx::construct<query_tree>(__VA_ARGS__)

        part1 = raw[*(str_pattern | char_ - char_("()"))];
        part2 = part1[SET] > *(parenthese[ADD]) >
            (char_(')')[ADD] | part2[ADD]);
        parenthese = char_('(')[SET] > part2[ADD];

        range = raw[sym_open > *(char_ - sym_close) > sym_close];

        name         = raw[+(encoding::char_ - (':' | space | ')')) > ':'];
        other_value  = raw[+(encoding::char_ - space - ')')];
        string_value = raw[str_pattern];
        field        = name[SET] >
            (string_value | parenthese | range | other_value)[ADD];

        group = '(' > query > ')';
        must     = "+" > (group[SET] |
                      field[_val = phx::construct<query_tree>(0, _1)]);
        must_not = (string("-") | string("NOT")) >
            (group[SET] |
             field[_val = phx::construct<query_tree>(0, _1)]);
        should =
            group[SET] | field[_val = phx::construct<query_tree>(2, _1)];

        expr = (must[_val = phx::construct<query_tree>(0, _val, _1)] |
                must_not[_val = phx::construct<query_tree>(1, _val, _1)] |
                should[_val = phx::construct<query_tree>(2, _val, _1)]);

        And = expr[_val = phx::construct<query_tree>(_val, _1)] >
            *((string("AND") | string("&&")) >
              expr[_val = phx::construct<query_tree>(3, _val, _1)]);
        Or = And[SET] >
            *((string("OR") | string("||")) >
              And[_val = phx::construct<query_tree>(_val, _1)]);

        query = *(Or[_val = phx::construct<query_tree>(_val, _1)]);
        start = qi::skip(encoding::space) [ query > qi::eoi ];
    }

  private:
    qi::rule<It, query_tree()> start;
    //
    qi::rule<It, query_tree(), encoding::space_type>  must, must_not, should,
        query, expr, group, And, Or;
    qi::rule<It, String(), encoding::space_type> field;
    // lexemes:
    qi::rule<It, String()> parenthese, part1, part2, range, name,
        string_value, other_value;
};

String parse_from_lucene(String const& input) {
    auto f(std::begin(input)), l(std::end(input));
    parser<decltype(f)> static const p{};

    String str;
    try {
        query_tree result;
        qi::parse(f, l, qi::eps > p, result);
    } catch (const qi::expectation_failure<decltype(f)>& e) {
        throw std::runtime_error("expectation_failure at '" +
                                 std::string(e.first, e.last) + "'\n");
    }
    return str;
}
} // namespace DB

#include <codecvt>
#include <locale>

int main()
{
    DB::String input = U"geo_dip_subdivision:(+国 -民)";
    input = DB::parse_from_lucene(input);
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;

    std::cout << converter.to_bytes(input) << std::endl;
}

With no output, as expected.

Oh, yes, str_pattern has a bug in it. What surprise me it I change the definition to `#define str_pattern ('"' > *(qi::unicode::char_ - ('"') | "\\\"") > '"')`, switching `left | right` to `right | left`, a parsing error is reported, open/close is OK becaouse [1, 2} means half-close half-open range [1, 2) in math. `type==3` has no problem, grammar in lucene the expression `should && (should | must)` means `must && must`; the empty `str` is because I deleted some code to make the code simpler. — user2256177, Sep 14 '22 at 07:59
Can't qi::no_skip_type use as a... skipper? I just want to keep everything untouched in parentheses, and found it seems works as I expect — user2256177, Sep 14 '22 at 08:03
No. `no_skip_type` is the type of the [`qi::no_skip[]` directive](https://www.boost.org/doc/libs/1_80_0/libs/spirit/doc/html/spirit/qi/reference/directive/no_skip.html). Without anything it would be invalid parser expression: https://godbolt.org/z/5E13a3YeW. The `rule<>` is sadly liberal in what it accepts. See here for how to do it correctly https://stackoverflow.com/a/17073965/85371 (also what I demonstrated in my code, see the `lexemes` comment) — sehe, Sep 14 '22 at 11:10
Good that most of the surprising things I noticed work out well for your application.I'd still be very interested in _"a reference to the grammar documentation and a list of examples that you expect to parse"_ because I'd love to show a semantic-action less, much shorter version that's easier to maintain. If you want I can do it in X3 as well — sehe, Sep 14 '22 at 11:13
Some of the lucene query sytax can be found in https://lucene.apache.org/core/2_9_4/queryparsersyntax.html. Seems it does not includes everything. For me: 1. + means must, 2. - means must_not, 3 no prefix means should 4. space between must, must_not means and, between should means or, 5: parentheses means grouping, 6. [1, 2] = [1, 2]; [1, 2} => [1, 2); {1, 2] => (1, 2]; {1, 2} => (1, 2) in math. — user2256177, Sep 14 '22 at 11:34
examples: 1. +a:b +a:c means (a:b and a:c) 2. +a:b a:c means a:b 3. +a:b +(a:c a:d) means (a:b and (a:c or a:d)) 4 +a:b +(+a:c a:d) means (a:b and a:c) 5. a:[1, 4) means (a:1 or a:2 or a:3) 5. a:b -a:c means (a:b and not a:c) 6. a:b and a:c means +a:b +a:c, e.g. (a:b and a:c) 7. a:b a:c means (a:b or a:c) — user2256177, Sep 14 '22 at 11:34
Would you explain why `#define str_pattern ('"' > *(qi::unicode::char_ - ('"') | "\\\"") > '"')` does not work and `#define str_pattern ('"' > *("\\\"" | qi::unicode::char_ - ('"')) > '"') `works? — user2256177, Sep 14 '22 at 11:39
https://lucene.apache.org/core/9_3_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package.description, syntax updated document — user2256177, Sep 14 '22 at 11:44
Re. str_pattern: PEG grammars consume input left-to-right greedy — sehe, Sep 14 '22 at 12:14

Parsing text including Unicode using spirit crashes with boost-1.78, but OK with boost-1.67, Why?

1 Answers1