5

I am toying with Boost.Spirit. As part of a larger work I am trying to construct a grammar for parsing C/C++ style string literals. I encountered a problem:

How do I create a sub-grammar that appends a std::string() result to the calling grammar's std::string() attribute (instead of just a char?

Here is my code, which is working so far. (Actually I already got much more than that, including stuff like '\n' etc., but I cut it down to the essentials.)

#define BOOST_SPIRIT_UNICODE

#include <string>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>

using namespace boost;
using namespace boost::spirit;
using namespace boost::spirit::qi;

template < typename Iterator >
struct EscapedUnicode : grammar< Iterator, char() > // <-- should be std::string
{
    EscapedUnicode() : EscapedUnicode::base_type( escaped_unicode )
    {
        escaped_unicode %= "\\" > ( ( "u" >> uint_parser< char, 16, 4, 4 >() )
                                  | ( "U" >> uint_parser< char, 16, 8, 8 >() ) );
    }

    rule< Iterator, char() > escaped_unicode;  // <-- should be std::string
};

template < typename Iterator >
struct QuotedString : grammar< Iterator, std::string() >
{
    QuotedString() : QuotedString::base_type( quoted_string )
    {
        quoted_string %= '"' >> *( escaped_unicode | ( char_ - ( '"' | eol ) ) ) >> '"';
    }

    EscapedUnicode< Iterator > escaped_unicode;
    rule< Iterator, std::string() > quoted_string;
};

int main()
{
    std::string input = "\"foo\u0041\"";
    typedef std::string::const_iterator iterator_type;
    QuotedString< iterator_type > qs;
    std::string result;
    bool r = parse( input.cbegin(), input.cend(), qs, result );
    std::cout << result << std::endl;
}

This prints fooA -- the QuotedString grammar calls the EscapedUnicode grammar, which results in a char being added to the std::string attribute of QuotedString (the A, 0x41).

But of course I would need to generate a sequence of chars (bytes) for anything beyond 0x7f. EscapedUnicode would neet to produce a std::string, which would have to be appended to the string generated by QuotedString.

And that is where I've met a roadblock. I do not understand the things Boost.Spirit does in concert with Boost.Phoenix, and any attempts I have made resulted in lengthy and pretty much undecipherable template-related compiler errors.

So, how can I do this? The answer need not actually do the proper Unicode conversion; it's the std::string issue I need a solution for.

DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • "But of course I would need to generate a sequence of chars (bytes) for anything beyond 0x7f." - what encoding do you want? – sehe Aug 29 '15 at 22:24
  • just as a comment: in general when you have questions like "why isn't spirit appending / concatenating / doing what I want with my sequences", have a look at the cheat sheet: http://www.boost.org/doc/libs/1_58_0/libs/spirit/doc/html/spirit/qi/quick_reference/compound_attribute_rules.html – Chris Beck Aug 30 '15 at 00:04

1 Answers1

5

A few points applied:

  • please do not blanket using namespace in relation to highly generic code. ADL will ruin your day unless you control it
  • Operator %= is auto-rule assignment, meaning that automatic attribute propagation will be forced even in the presence of semantic actions. You don't want that because the attribute exposed by uint_parser will not be (correctly) automatically propagated if you want to encode into multi-byte string representation.
  • The input string

    std::string input = "\"foo\u0041\"";
    

    needed to be

    std::string input = "\"foo\\u0041\"";
    

    otherwise the compiler did the escape handling before the parser even runs :)

Here come the specific tricks for the meat of the task:

  • You will want to change the rule's declared attribute to something that Spirit will automatically "flatten" in simple sequences. E.g.

    quoted_string = '"' >> *(escaped_unicode | (qi::char_ - ('"' | qi::eol))) >> '"';
    

    Will not append because the first branch of the alternate results in a sequence of char, and the second in a single char. The following spelling of the equivalent:

    quoted_string = '"' >> *(escaped_unicode | +(qi::char_ - ('"' | qi::eol | "\\u" | "\\U"))) >> '"';
    

    subtly triggers the appending heuristic in Spirit, so we can achieve what we want without involving Semantic Actions.

The rest is straight-forward:

  • implement the actual encoding with a Phoenix function object:

    struct encode_f {
        template <typename...> struct result { using type = void; };
    
        template <typename V, typename CP> void operator()(V& a, CP codepoint) const {
            // TODO implement desired encoding (e.g. UTF8)
            bio::stream<bio::back_insert_device<V> > os(a);
            os << "[" << std::hex << std::showbase << std::setw(std::numeric_limits<CP>::digits/4) << std::setfill('0') << codepoint << "]";
        }
    };
    boost::phoenix::function<encode_f> encode;
    

    This you can then use like:

    escaped_unicode = '\\' > ( ("u" >> uint_parser<uint16_t, 16, 4, 4>() [ encode(_val, _1) ])
                             | ("U" >> uint_parser<uint32_t, 16, 8, 8>() [ encode(_val, _1) ]) );
    

    Because you mentioned you don't care about the specific encoding, I elected to encode the raw codepoint in 16bit or 32bit hex representation like [0x0041]. I pragmatically used Boost Iostreams which is capable of directly writing into the attribute's container type

  • Use BOOST_SPIRIT_DEBUG* macros

Live On Coliru

//#define BOOST_SPIRIT_UNICODE
//#define BOOST_SPIRIT_DEBUG

#include <string>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>

// for demo re-encoding
#include <boost/iostreams/device/back_inserter.hpp>
#include <boost/iostreams/stream.hpp>
#include <iomanip>

namespace qi  = boost::spirit::qi;
namespace bio = boost::iostreams;
namespace phx = boost::phoenix;

template <typename Iterator, typename Attr = std::vector<char> > // or std::string for that matter
struct EscapedUnicode : qi::grammar<Iterator, Attr()>
{
    EscapedUnicode() : EscapedUnicode::base_type(escaped_unicode)
    {
        using namespace qi;

        escaped_unicode = '\\' > ( ("u" >> uint_parser<uint16_t, 16, 4, 4>() [ encode(_val, _1) ])
                                 | ("U" >> uint_parser<uint32_t, 16, 8, 8>() [ encode(_val, _1) ]) );

        BOOST_SPIRIT_DEBUG_NODES((escaped_unicode))
    }

    struct encode_f {
        template <typename...> struct result { using type = void; };

        template <typename V, typename CP> void operator()(V& a, CP codepoint) const {
            // TODO implement desired encoding (e.g. UTF8)
            bio::stream<bio::back_insert_device<V> > os(a);
            os << "[0x" << std::hex << std::setw(std::numeric_limits<CP>::digits/4) << std::setfill('0') << codepoint << "]";
        }
    };
    boost::phoenix::function<encode_f> encode;

    qi::rule<Iterator, Attr()> escaped_unicode;
};

template <typename Iterator>
struct QuotedString : qi::grammar<Iterator, std::string()>
{
    QuotedString() : QuotedString::base_type(start)
    {
        start = quoted_string;
        quoted_string = '"' >> *(escaped_unicode | +(qi::char_ - ('"' | qi::eol | "\\u" | "\\U"))) >> '"';
        BOOST_SPIRIT_DEBUG_NODES((start)(quoted_string))
    }

    EscapedUnicode<Iterator> escaped_unicode;
    qi::rule<Iterator, std::string()> start;
    qi::rule<Iterator, std::vector<char>()> quoted_string;
};

int main() {
    std::string input = "\"foo\\u0041\\U00000041\"";

    typedef std::string::const_iterator iterator_type;
    QuotedString<iterator_type> qs;
    std::string result;
    bool r = parse( input.cbegin(), input.cend(), qs, result );
    std::cout << std::boolalpha << r << ": '" << result << "'\n";
}

Prints:

true: 'foo[0x0041][0x00000041]'
Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • I *never* use `using namespace`, except when trying to boil a long list of explicit `using boost::spirit::qi::uint_parser` etc. down to as few LoC as possible. ;-) I think I'll need a bit of time pondering the "straightforward" part of your answer (because it's exactly that Phoenix stuff that's still giving me headaches). But it works, so many thanks for your elaborate answer. – DevSolar Aug 30 '15 at 08:54
  • To be honest, I think it is quite possible that noisy errors caused by subtle other things (like `%=`) were ruining the party getting the straightforward part. If you show what you were stuck with instead I could spot the problem if you want – sehe Aug 30 '15 at 10:27
  • In the end it turned out quite differently to what I originally had in mind (and to what you demonstrated here). I posted it at [codereview.SE](http://codereview.stackexchange.com/questions/102374), hoping to iron out the remaining not-so-niceties. – DevSolar Aug 31 '15 at 11:10