2

I'm trying to write a parser using boost::spirit::qi which will parse everything between a pair of " as-is, and allowing escaping of " characters. I.E., "ab\n\"" should return ab\n\". I've tried with the following code (godbolt link):

#include <boost/spirit/include/qi.hpp>
#include <string>

namespace qi = boost::spirit::qi;

int main() {
    std::string input{R"("ab\n\"")"};
    std::cout << "[" << input << "]\n";

    std::string output;

    using Skipper = qi::rule<std::string::const_iterator>;
    Skipper skip = qi::space;
    qi::rule<std::string::const_iterator, std::string(), Skipper> qstring;

    qstring %= qi::lit("\"") 
        > ( *( (qi::print - qi::lit('"') - qi::lit("\\")) | (qi::char_("\\") > qi::print) ) )
                                                            //   ^^^^^
        > qi::lit("\"");

    auto success = qi::phrase_parse(input.cbegin(), input.cend(), qstring, skip, output);

     if (!success) {
        std::cout << "Failed to parse";
        return 1;
    }
    
    std::cout << "output = [" << output << "]\n";

    return 0;
}

This fails to compile based on some template errors,

/opt/compiler-explorer/libs/boost_1_81_0/boost/spirit/home/support/container.hpp:130:12: error: 'char' is not a class, struct, or union type
  130 |     struct container_value
      |            ^~~~~~~~~~~~~~~
.....
/opt/compiler-explorer/libs/boost_1_81_0/boost/spirit/home/qi/detail/pass_container.hpp:320:66: error: no type named 'type' in 'struct boost::spirit::traits::container_value<char, void>'
  320 |             typedef typename traits::container_value<Attr>::type value_type;

I can get the code to compile if I change the underlined qi::char_("\\") with qi::lit("\\"), but that doesn't create an attribute for the \ which it matches. I've also found that I can get it to compile if I create a new rule which embodies just the Kleene star, but is there a way to get boost to use the correct types in a single expression?

qi::rule<std::string::const_iterator, std::string(), Skipper> qstring;
qi::rule<std::string::const_iterator, std::string(), Skipper> qstringbody;

qstringbody %= ( *( (qi::print - qi::lit('"') - qi::lit("\\")) | (qi::char_("\\") > qi::print) ) );
qstring %= qi::lit("\"") 
    > qstringbody
    > qi::lit("\"");
Dval
  • 352
  • 1
  • 10
  • Not relevant to your problem, but note that **Expect** `a > b` will throw `expectation_failure` if `b` does not follow `a`, whereas **Sequence** `a >> b` will not. – Eljay Jan 06 '23 at 15:22

1 Answers1

2

qi::char_("\") with qi::lit("\"), but that doesn't create an attribute for the \ which it matches

This is what you require. Parsing should translate the input representation (syntaxis) into your meaningful representation (semantics). It is possible to have an AST that reflects escapes, of course, but then you would NOT be parsing into a string, but something like

struct char_or_escape {
      enum { hex_escape, octal_escape, C_style_char_esc, unicode_codepoint_escape, named_unicode_escape } type;
      std::variant<uint32_t, std::string> value;
};
using StringAST = std::vector<char_or_escape>;

Presumably, you don't want to keep the raw input (otherwise, qi::raw[] is your friend).

Applying It

Here's my simplification

qi::rule<It, std::string(), Skipper> qstring //
    = '"' > *(qi::print - '"' - "\\" | "\\" > qi::print) > '"';

Side note: It seems to require printables only. I'll remove that assumption in the following. You can, of course, reintroduce character subsets as you require.

qstring = '"' > *(~qi::char_("\"\\") | '\\' > qi::char_) > '"';

Reordering the branches removes the need to except '\\', while being more expressive about intent:

qstring = '"' > *('\\' > qi::char_ | ~qi::char_('"')) > '"';

Now, from the example input I gather that you might require a C-style treatment of escapes. May I suggest:

qi::symbols<char, char> c_esc;
c_esc.add("\\\\", '\\')                                                            //
    ("\\a", '\a')("\\b", '\b')("\\n", '\n')("\\f", '\f')("\\t", '\t')("\\r", '\r') //
    ("\\v", '\v')("\\0", '\0')("\\e", 0x1b)("\\'", '\'')("\\\"", '"')("\\?", 0x3f);

qstring = '"' > *(c_esc | '\\' >> qi::char_ | ~qi::char_('"')) > '"';

(Note some of these are redundant because they already encode into the secondary input character).

Demo

Live On Coliru

#include <boost/spirit/include/qi.hpp>
#include <iomanip>

namespace qi = boost::spirit::qi;

int main() {
    using It = std::string::const_iterator;

    using Skipper = qi::space_type;
    qi::rule<It, std::string(), Skipper> qstring;

    qi::symbols<char, char> c_esc;
    c_esc.add("\\\\", '\\')                                                            //
        ("\\a", '\a')("\\b", '\b')("\\n", '\n')("\\f", '\f')("\\t", '\t')("\\r", '\r') //
        ("\\v", '\v')("\\0", '\0')("\\e", 0x1b)("\\'", '\'')("\\\"", '"')("\\?", 0x3f);

    qstring = '"' > *(c_esc | '\\' >> qi::char_ | ~qi::char_('"')) > '"';

    for (std::string input :
         {
             R"("")",
             R"("ab\n\"")",
             R"("ab\r\n\'")",
         }) //
    {
        std::string output;
        bool success = phrase_parse(input.cbegin(), input.cend(), qstring, qi::space, output);

        if (!success)
            std::cout << quoted(input) << " -> FAILED\n";
        else
            std::cout << quoted(input) << " -> " << quoted(output) << "\n";
    }
}

Printing

"\"\"" -> ""
"\"ab\\n\\\"\"" -> "ab
\""
"\"ab\\r\\n\\'\"" -> "ab
'"

Further Reading

For more complete escape handling, see here: Creating a boost::spirit::x3 parser for quoted strings with escape sequence handling (also alternative approaches instead of the symbols).

It contains a list of even more elaborate examples (JSON style unicode escapes etc.)

sehe
  • 374,641
  • 47
  • 450
  • 633
  • For completeness, [compare `qi::raw['"' > *('\\' >> qi::char_ | ~qi::char_('"')) > '"']`](http://coliru.stacked-crooked.com/a/f7841d50c5865981) – sehe Jan 06 '23 at 17:57
  • I like this showing how to do this with a symbol table; I'd actually handled that case with a semantic action to handle the replacement. But my issue still stands, and is still visible in this solution-- If your string includes an escaped character which isn't in the symbol table, it simply removes the `\`. I want it to preserve that escape character. – Dval Jan 06 '23 at 19:26
  • My answer explains why you _don't_. If you want the semantic action approach, go the `qi::raw` route. – sehe Jan 06 '23 at 19:28
  • This "string" is only partially being parsed; its contents may be passed to a further downstream parser which have their own handling of escaped characters. So, yes, I do want the escape characters to remain. – Dval Jan 06 '23 at 19:45
  • And even more fundamentally for my understanding of boost spirit: if you change `qstring = '"' > *(c_esc | '\\' >> qi::char_ | ~qi::char_('"')) > '"';` to `qstring = '"' > *(c_esc | qi::char_('\\') >> qi::char_ | ~qi::char_('"')) > '"';`, it will not compile – Dval Jan 06 '23 at 19:49
  • #1 semantic action escape artist with `raw[]`: http://coliru.stacked-crooked.com/a/443539b3f8727744 – sehe Jan 06 '23 at 21:04
  • #2 your fundamental understanding of spirit: http://coliru.stacked-crooked.com/a/035c4cb7f4ebc9e0 Note that `fusion::vector` is like `std::tuple`. – sehe Jan 06 '23 at 21:06
  • Thanks for the example with the semantic action which senses the type of the rule! That helps greatly! – Srinath Avadhanula Jan 06 '23 at 21:20
  • 1
    There was another approach here https://stackoverflow.com/questions/9404189/detecting-the-parameter-types-in-a-spirit-semantic-action/9405265#9405265, which is kinda a FAQ entry. However I came up with the newer way to phrase it using c++17, lambdas and also qi::copy which wasn't there at the time. – sehe Jan 06 '23 at 21:41