Creating a boost::spirit::x3 parser for quoted strings with escape sequence handling

Question

I need to create a parser for quoted strings for my custom language that will also properly handle escape sequences, which includes allowing escaped quotes within the string. This is my current string parser:

x3::lexeme[quote > *(x3::char_ - quote) > quote]

where quote is just a constant expression for '"'. It does no escape sequence handling whatsoever. I know about boost::spirit::classic::lex_escape_ch_p, but I've no idea how to use that with the boost::spirit::x3 tools (or in general). How could I create a parser that does this? The parser would have to recognize most escape sequences, such as common ones like '\n', '\t', and more complex stuff like hex, oct, and ansi escape sequences.

My apologies if there's something wrong with this post, it's my first time posting on SO.

EDIT:

Here is how I ended up implementing the parser:

x3::lexeme[quote > *(
    ("\\\"" >> &x3::char_) >> x3::attr(quote) | ~x3::char_(quote)
    ) > quote]
[handle_escape_sequences];

where handle_escape_sequences is a lambda:

auto handle_escape_sequences = [&](auto&& context) -> void {
    std::string& str = x3::_val(context);

    uint32_t i{};

    static auto replace = [&](const char replacement) -> void {
        str[i++] = replacement;
    };

    if (!classic::parse(std::begin(str), std::end(str), *classic::lex_escape_ch_p[replace]).full)
        throw Error{ "invalid literal" }; // invalid escape sequence most likely

    str.resize(i);
};

It does full ANSI escape sequence parsing, which means you can use it to do all sorts of fancy terminal manipulation like setting the text color, cursor position, etc. with it.

Here's the full definition of the rule as well as all of the stuff it depends on (I just picked everything related to it out of my code so that's why the result looks like proper spaghetti) in case someone happens to need it:

#include <boost\spirit\home\x3.hpp>
#include <boost\spirit\include\classic_utility.hpp>

using namespace boost::spirit;

#define RULE_DECLARATION(rule_name, attribute_type)                            \
inline namespace Tag { class rule_name ## _tag; }                              \
x3::rule<Tag::rule_name ## _tag, attribute_type, true> rule_name = #rule_name; \

#define SIMPLE_RULE_DEFINITION(rule_name, attribute_type, definition) \
RULE_DECLARATION(rule_name, attribute_type)                           \
auto rule_name ## _def = definition;                                  \
BOOST_SPIRIT_DEFINE(rule_name);

constexpr char quote = '"';


template <class Base, class>
struct Access_base_s : Base {
    using Base::Base, Base::operator=;
};

template <class Base, class Tag>
using Unique_alias_for = Access_base_s<Base, Tag>;


using String_literal = Unique_alias_for<std::string, class String_literal_tag>;

SIMPLE_RULE_DEFINITION(string_literal, String_literal,
    x3::lexeme[quote > *(
        ("\\\"" >> &x3::char_) >> x3::attr(quote) | ~x3::char_(quote)
        ) > quote]
    [handle_escape_sequences];
);

sehe · Accepted Answer · 2020-05-09T12:39:14.173

I have many examples of this on this site¹

Let met start with simplifying your expression (~charset is likely more efficient than charset - exceptions):

x3::lexeme['"' > *~x3::char_('"')) > '"']

Now, to allow escapes, we can decode them adhoc:

auto qstring = x3::lexeme['"' > *(
         "\\n" >> x3::attr('\n')
       | "\\b" >> x3::attr('\b')
       | "\\f" >> x3::attr('\f')
       | "\\t" >> x3::attr('\t')
       | "\\v" >> x3::attr('\v')
       | "\\0" >> x3::attr('\0')
       | "\\r" >> x3::attr('\r')
       | "\\n" >> x3::attr('\n')
       | "\\"  >> x3::char_("\"\\")
       | ~x3::char_('"')
   ) > '"'];

Alternatively you could use a symbols approach, either including or excluding the slash:

x3::symbols<char> escapes;
escapes.add
    ( "\\n", '\n')
    ( "\\b", '\b')
    ( "\\f", '\f')
    ( "\\t", '\t')
    ( "\\v", '\v')
    ( "\\0", '\0')
    ( "\\r", '\r')
    ( "\\n", '\n')
    ( "\\\\", '\\')
    ( "\\\"", '"');

auto qstring = x3::lexeme['"' > *(escapes | ~x3::char_('"')) > '"'];

See it Live On Coliru as well.

I think I prefer the hand-rolled branches, because they give you flexibility to do e.g. he/octal escapes (mind the conflict with \0 though):

       | "\\" >> x3::int_parser<char, 8, 1, 3>()
       | "\\x" >> x3::int_parser<char, 16, 2, 2>()

Which also works fine:

Live On Coliru

#include <boost/spirit/home/x3.hpp>
#include <iostream>
#include <iomanip>

int main() {
    namespace x3 = boost::spirit::x3;

    auto qstring = x3::lexeme['"' > *(
             "\\n" >> x3::attr('\n')
           | "\\b" >> x3::attr('\b')
           | "\\f" >> x3::attr('\f')
           | "\\t" >> x3::attr('\t')
           | "\\v" >> x3::attr('\v')
           | "\\r" >> x3::attr('\r')
           | "\\n" >> x3::attr('\n')
           | "\\"  >> x3::char_("\"\\")
           | "\\" >> x3::int_parser<char, 8, 1, 3>()
           | "\\x" >> x3::int_parser<char, 16, 2, 2>()
           | ~x3::char_('"')
       ) > '"'];

    for (std::string const input : { R"("\ttest\x41\x42\x43 \x031\x032\x033 \"hello\"\r\n")" }) {
        std::string output;
        auto f = begin(input), l = end(input);
        if (x3::phrase_parse(f, l, qstring, x3::blank, output)) {
            std::cout << "[" << output << "]\n";
        } else {
            std::cout << "Failed\n";
        }
        if (f != l) {
            std::cout << "Remaining unparsed: " << std::quoted(std::string(f,l)) << "\n";
        }
    }
}

Prints

[   testABC 123 "hello"
]

¹ Have a look at these

Qi, simple: Replace lit with different string in boost spirit
Qi, complete JSON-style: Handling utf-8 in Boost.Spirit with utf-32 parser
Qi, practical X500/LDAP distinguished names style: How to parse a grammar into a `std::set` using `boost::spirit`?
Qi, practical C-style escapes boost spirit parsing quote string fails

Thank you very much for your help. From your answers to both of my posts I gained some crucial information. I only just picked up `boost::spirit`, and I think this info might've helped my knowledge reach critical mass to properly start learning about the library. I used a combination of what I asked for in my posts to implement an escape sequence handler without having to hardcode in the sequences by hand. See my edit to the original post if you wanna have a look! — , May 09 '20 at 22:08
That looks funky :) I'm not sure I'd recommend relying on Spirit Classic code. Also, I'm pretty confused how you'd be handling ANSI escapes (I would think you would just pass-through, and I don't see how you translate it into anything else actionable (like HTML or similar?)). But yeah, certainly looks like you got the hang of things. — sehe, May 09 '20 at 23:47
I was also reluctant to use anything from the `classic` namespace, but the `lex_escape_ch_p` thingy was the only thing I could find that does anything close to what I needed. No idea how it does what it does, but it works amazingly. I've been throwing anything I could think of at it, and after a few bug fixes it seems to be able to handle any escape sequence. — , May 10 '20 at 00:34
_"[...] was the only thing I could find"_ Huh. Didn't you ask the question for that reason? The above answer _already_ parses **exactly** what `c_escape_ch_p` does, and with a [very very minor simplification](http://coliru.stacked-crooked.com/a/5162d82bcd02ff81), it does exactly what `lex_escape_ch_p` does([documentation](https://www.boost.org/doc/libs/1_35_0/libs/spirit/doc/escape_char_parser.html)) — sehe, May 10 '20 at 13:04
If your point is that you rather have it as a character parser instead of a sequence, just drop the `*()`... It's as simple as that: http://coliru.stacked-crooked.com/a/2410b0e3eb707c54 — sehe, May 10 '20 at 13:07
The question was specifically about `spirit::x3`, and I was trying to avoid using the `classic` stuff which is why I still asked. Good thing too, I wouldn't've been able to implement it otherwise. I mainly needed a way to to parse all sorts of sequences, like `"\x1b[91m"` (which sets the terminal foreground color to bright red). Or would the approach you suggested be able to do that as well? (Please note that I'm being sincere, this is a real question) — , May 10 '20 at 19:59
Yeah. I appreciate that the question is sincere. I'm just a bit surprised. Isn't it obvious? And did you not try it out? That seems like [a lot less trouble](http://coliru.stacked-crooked.com/a/420250a2c7c7939a) than making the kludge with the semantic action work. Obviously, if you want it to have the ANSI effect on an ANSI capable terminal, just `std::cout << parsed;` instead. Coliru is not ANSI capable, so print the hex to make it obvious that it will work. — sehe, May 10 '20 at 21:18

Creating a boost::spirit::x3 parser for quoted strings with escape sequence handling

1 Answers1

Linked